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Chapter 1 
Introduction 



This compendium aims at providing a comprehensive overview of the main topics that ap- 
pear in any well-structured course sequence in statistics for business and economics at the 
undergraduate and MBA levels. The idea is to supplement either formal or informal statistic 
textbooks such as, e.g., "Basic Statistical Ideas for Managers" by D.K. Hildebrand and R.L. 
Ott and "The Practice of Business Statistics: Using Data for Decisions" by D.S. Moore, 
G.P. McCabe, W.M. Duckworth and S.L. Sclove, with a summary of theory as well as with 
a couple of extra examples. In what follows, we set the road map for this compendium by 
describing the main steps of statistical analysis. 
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Statistics for Business and Economics Introduction 

Statistics is the science and art of making sense of both quantitative and qualitative data. 
Statistical thinking now dominates almost every field in science, including social sciences such 
as business, economics, management, and marketing. It is virtually impossible to avoid data 
analysis if we wish to monitor and improve the quality of products and processes within a 
business organization. This means that economists and managers have to deal almost daily 
with data gathering, management, and analysis. 



1.1 Gathering data 



Collecting data involves two key decisions. The first refers to what to measure. Unfortu- 
nately, it is not necessarily the case that the easiest-to-measure variable is the most relevant 
for the specific problem in hand. The second relates to how to obtain the data. Sometimes 
gathering data is costless, e.g., a simple matter of internet downloading. However, there are 
many situations in which one must take a more active approach and construct a data set 
from scratch. 

Data gathering normally involves either sampling or experimentation. Albeit the latter 
is less common in social sciences, one should always have in mind that there is no need for a 
lab to run an experiment. There is pretty of room for experimentation within organizations. 
And we are not speaking exclusively about research and development. For instance, we could 
envision a sales competition to test how salespeople react to different levels of performance 
incentives. This is just one example of a key driver to improve quality of products and 
processes. 

Sampling is a much more natural approach in social sciences. It is easy to appreciate 

that it is sometimes too costly, if not impossible, to gather universal data and hence it makes 

sense to restrict attention to a representative sample of the population. For instance, while 

census data are available only every 5 or 10 years due to the enormous cost/effort that it 

involves, there are several household and business surveys at the annual, quarterly, monthly, 

and sometimes even weekly frequency. 
Download free eBooks at bookboon.com 
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1.2 Data handling 

Raw data are normally not very useful in that we must normally do some data manipulation 
before carrying out any piece of statistical analysis. Summarizing the data is the primary 
tool for this end. It allows us not only to assess how reliable the data are, but also to 
understand the main features of the data. Accordingly, it is the first step of any sensible 
data analysis. 

Summarizing data is not only about number crunching. Actually, the first task to trans- 
form numbers into valuable information is invariably to graphically represent the data. A 
couple of simple graphs do wonders in describing the most salient features of the data. For 
example, pie charts are essential to answer questions relating to proportions and fractions. 
For instance, the riskiness of a portfolio typically depends on how much investment there 
is in the risk-free asset relative to the overall investment in risky assets such as those in 
the equity, commodities, and bond markets. Similarly, it is paramount to map the source 
of problems resulting in a warranty claim so as to ensure that design and production man- 
agers focus their improvement efforts on the right components of the product or production 
process. 

The second step is to find the typical values of the data. It is important to know, for 
example, what is the average income of the households in a given residential neighborhood if 
you wish to open a high-end restaurant there. Averages are not sufficient though, for interest 
may sometimes lie on atypical values. It is very important to understand the probability 
of rare events in risk management. The insurance industry is much more concerned with 
extreme (rare) events than with averages. 

The next step is to examine the variation in the data. For instance, one of the main 
tenets of modern finance relates to the risk-return tradeoff, where we normally gauge the 
riskiness of a portfolio by looking at how much the returns vary in magnitude relative to 
their average value. In quality control, we may improve the process by raising the average 
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quality of the final product as well as by reducing the quality variability. Understanding 
variability is also key to any statistical thinking in that it allows us to assess whether the 
variation we observe in the data is due to something other than random variation. 

The final step is to assess whether there is any abnormal pattern in the data. For instance, 
it is interesting to examine nor only whether the data are symmetric around some value but 
also how likely it is to observe unusually high values that are relatively distant from the bulk 
of data. 

1.3 Probability and statistical inference 



It is very difficult to get data for the whole population. It is very often the case that it is 
too costly to gather a complete data set about a subset of characteristics in a population, 
either because of economic reasons or because of the computational burden. For instance, it 
is impossible for a firm that produces millions and millions of nails every day to check each 
one of their nails for quality control. This means that, in most instances, we will have to 
examine data coming from a sample of the population. 
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As a sample is just a glimpse of the entire population, it will entail some degree of uncer- 
tainty to the statistical problem. To ensure that we are able to deal with this uncertainty, it 
is very important to sample the data from its population in a random manner, otherwise 
some sort of selection bias might arise in the resulting data sample. For instance, if you wish 
to assess the performance of the hedge fund industry, it does not suffice to collect data about 
living hedge funds. We must also collect data on extinct funds for otherwise our database 
will be biased towards successful hedge funds. This sort of selection bias is also known as 
survivorship bias. 

The random nature of a sample is what makes data variability so important. Probability 
theory essentially aims to study how this sampling variation affects statistical inference, 
improving our understanding how reliable our inference is. In addition, inference theory is 
one of the main quality-control tools in that it allows to assess whether a salient pattern 
in data is indeed genuine beyond reasonable random variation. For instance, some equity 
fund managers boast to have positive returns for a number of consecutive periods as if this 
would entail unrefutable evidence of genuine stock-picking ability. However, in a universe of 
thousands and thousands of equity funds, it is more than natural that, due to sheer luck, 
a few will enjoy several periods of positive returns even if the stock returns are symmetric 
around zero, taking positive and negative values with equal likelihood. 
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Chapter 2 
Data description 



The first step of data analysis is to summarize the data by drawing plots and charts as well 
as by computing some descriptive statistics. These tools essentially aim to provide a better 
understanding of how frequent the distinct data values are, and of how much variability 
there is around a typical value in the data. 

2.1 Data distribution 

It is well known that a picture tells more than a million words. The same applies to any 
serious data analysis for graphs are certainly among the best and most convenient data 
descriptors. We start with a very simple, though extremely useful, type of data plot that 
reveals the frequency at which any given data value (or interval) appears in the sample. A 
frequency table reports the number of times that a given observation occurs or, if based 
on relative terms, the frequency of that value divided by the number of observations in the 
sample. 

Example A firm in the transformation industry classifies the individuals at managerial 
positions according to their university degree. There are currently 1 accountant, 3 adminis- 
trators, 4 economists, 7 engineers, 2 lawyers, and 1 physicist. The corresponding frequency 
table is as follows. 
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degree 

value 

counts 

relative frequency 



accounting business economics engineering law physics 

12 3 4 5 6 

13 4 7 2 1 



1/1* 



1/6 



2/9 



7/18 1/9 1/1* 



Note that the degree subject that a manager holds is of a qualitative nature, and so it is not 
particularly meaningful if one associates a number to each one of these degrees. The above 
table does so in the row reading 'value' according to the alphabetical order, for instance. 

The corresponding plot for this type of categorical data is the bar chart. Figure 2.1 plots 
a bar chart using the degrees data in the above example. This is the easiest way to identify 
particular shapes of the distribution of values, especially concerning data dispersion. Least 
data concentration occurs if the envelope of the bars forms a rectangle in that every data 
value appears at approximately the same frequency. 
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In statistical quality control, one very often employs bar charts to illustrate the reasons 
for quality failures (in order of importance, i.e., frequency). These bar charts (also known 
as Pareto charts in this particular case) are indeed very popular for highlighting the natural 
focus points for quality improvement. 

Bar charts are clearly designed to describe the distribution of categorical data. In a similar 
vein, histograms are the easiest graphical tool for assessing the distribution of quantitative 
data. It is often the case that one must first group the data into intervals before plotting a 
histogram. In contrast to bar charts, histogram bins are contiguous, respecting some sort of 
scale. 
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^1 
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economics engineering 
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Figure 2.1: Bar chart of managers' degree subjects 



2.2 Typical values 



There are three popular measures of central tendency: mode, mean, and median. The mode 
refers to the most frequent observation in the sample. If a variable may take a large number 
of values, it is then convenient to group the data into intervals. In this instance, we define the 
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mode as the midpoint of the most frequent interval. Even though the mode is a very intuitive 
measure of central tendency, it is very sensitive to changes, even if only marginal, in data 
values or in the interval definition. The mean is the most commonly-used type of average 
and so it is often referred to simply as the average. The mean of a set of numbers is the sum 
of all of the elements in the set divided by the number of elements: i.e., X^ = -L ^2 i=1 Xi. If 
the set is a statistical population, then we call it a population mean or expected value. If the 
data set is a sample of the population, we call the resulting statistic a sample mean. Finally, 
we define the median as the number separating the higher half of a sample/population from 
the lower half. We can compute the median of a finite set of numbers by sorting all the 
observations from lowest value to highest value and picking the middle one. 

Example Consider a sample of MBA graduates, whose first salaries (in $1,000 per annum) 
after graduating were as follows. 

75 86 86 87 89 95 95 95 95 95 

96 96 96 97 97 97 97 98 98 99 

99 99 99 100 100 100 105 110 110 110 

115 120 122 125 132 135 140 150 150 160 

165 170 172 175 185 190 200 250 250 300 

The mean salary is about $126,140 per annum, whereas the median figure is exactly $100,000 
and the mode amounts to $95,000. Now, if one groups the data into 8 evenly distributed 
bins between the minimum and maximum values, both the median and mode converge to 
same value of about $91,000 (i.e., the midpoint of the second bin). 

The mean value plays a major role in statistics. Although the median has several ad- 
vantages over the mean, the latter is easier to manipulate for it involves a simple linear 
combination of the data rather than a non-differentiable function of the data as the median. 
In statistical quality control, for instance, it is very common to display a means chart (also 
known as x-bar chart), which essentially plots the mean of a variable through time. We 
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say that a process is in statistical control if the means vary randomly but in a stable fash- 
ion, whereas it is out of statistical control if the plot shows either a dramatic variation or 
systematic changes. 

2.3 Measures of dispersion 

While measures of central tendency are useful to understand what are the typical values 
of the data, measures of dispersion are important to describe the scatter of the data or, 
equivalently, data variability with respect to the central tendency. Two distinct samples 
may have the same mean or median, but different levels of variability, or vice-versa. A 
proper description of data set should always include both of these characteristics. There are 
various measures of dispersion, each with its own set of advantages and disadvantages. 

We first define the sample range as the difference between the largest and smallest values 
in the sample. This is one of the simplest measures of variability to calculate. However, it 
depends only on the most extreme values of the sample, and hence it is very sensitive to 
outliers and atypical observations. In addition, it also provides no information whatsoever 
about the distribution of the remaining data points. To circumvent this problem, we may 
think of computing the interquartile range by taking the difference between the third and first 
quartiles of the distribution (i.e., subtracting the 25th percentile from the 75th percentile). 
This is not only a pretty good indicator of the spread in the center region of the data, but 
it is also much more resistant to extreme values than the sample range. 

We now turn our attention to the median absolute deviation, which renders a more 
comprehensive alternative to the interquartile range by incorporating at least partially the 
information from all data points in the sample. We compute the median absolute deviation 
by means of md \Xi — md(X)|, where md(-) denotes the median operator, yielding a very 
robust measure of dispersion to aberrant values in the sample. Finally, the most popular 
measure of dependence is the sample standard deviation as defined by the square root of 



the sample variance: i.e., sn = \l ]^!riX^=i {Xi — ^n) , where Xn is the sample mean. 
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The main advantage of variance-based measures of dispersion is that they are functions of 
a sample mean. In particular, the sample variance is the sample mean of the square of the 
deviations relative to the sample mean. 

Example Consider the sample of MBA graduates from the previous example. The 
variance of their first salary after graduating is about $2,288,400,000 per annum, whereas 
the standard deviation is $47,837. The range is much larger, amounting to 300, 000 — 
75, 000 = 225, 000 per annum. The huge difference between these two measures of dispersion 
suggests the presence of extreme values in the data. The fact that the interquartile range is 
i50,ooo+i50,ooo _ 96,000+96,000 = 54) ()0— and hence closer the the standard deviation— seems 

to corroborate this interpretation. Finally, the median absolute deviation of the sample is 
only 10,000 indicating that the aberrant values of the sample are among the largest (rather 
than smallest) values. 
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In statistical quality control, it is also useful to plot some measures of dispersion over 
time. The most common are the R and S charts, which respectively depict how the range 
and the standard deviation vary over time. The standard deviation is also informative in a 
means chart for the interval [mean value ± two standard deviations] contains about 95% of 
the data if their histogram is approximately bell-shaped (symmetric with a single peak). An 
alternative is to plot control limits at the mean value ± three standard deviations, which 
should include all of the data inside. These procedures are very useful in that they reduce 
the likelihood of a manager to go fire- fighting every short-term variation in the means chart. 
Only variations that are very likely to reflect something out of control will fall outside the 
control limits. 

A well-designed statistical quality-control system should take both means and dispersion 
charts into account for it is possible to improve on quality by reducing variability and/or 
by increasing average quality. For instance, a chef that reduces cooking time on average by 
5 minutes, with 90% of the dishes arriving 10 minutes earlier and 10% arriving 40 minutes 
later, will probably not make the owner of the restaurant very happy. 
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Chapter 3 

Basic principles of probability 

3.1 Set theory 

There are two fundamental sets, namely, the universe U and the empty set 0. We say they 
are fundamental because C A C U for every set A. 

Taking the difference between sets A and B yields a set whose elements are in A but 
not in B: A — B = {x\x & A and x $■ B}. Note that A — B is not necessarily the 
same as B — A. The union of A and B results in a set whose elements are in A or in B: 
AU B = {x \ x £ A or x £ B}. Naturally, if an element x belongs to both A and B, then it is 
also in the union A U B. In turn, the intersection of A and B individuates only the elements 
that both sets share in common: AC\B = {x\xEA and x G B}. Last but not least, the 
complement A of A defines a set with all elements in the universe that are not in A, that is 
to say, A = U - A = {x \ x £ A}. 

Example Suppose that you roll a die and take note of the resulting value. The universe 
is the set with all possible values, namely, U = {1,2,3,4,5,6}. Consider the following two 
sets: A = {1, 2, 3, 4} and B = {2, 4, 6}. It then follows that A - B = {1, 3}, B - A = {6}, 
A U B = {1, 2, 3, 4, 6}, and A n B = {2, 4}. 

If A and B are complementing sets, i.e., A = B, then A — B = A, B — A = B, AUB = V, 
and A n B = 0. Figure 3.1 illustrates how one may represent sets using a Venn diagram. 
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Figure 3.1: Venn diagram representing sets A (oval in blue and purple) and B (oval in red 
and purple) within the universe (rectangle box). The intersection A n B of A and B is in 
purple, whereas the overall area in color (i.e., red, blue, and purple) corresponds to the union 
set A U B. The complement of A consists of the areas in grey and red, whereas the areas in 
grey and blue define the complement of B. 

Properties The union and intersection operators are symmetric in that AU B = B U A 

and AD B = B D A. They are also transitive in that (A U B) U C = A U (B U C) and 

(AnB)nC = An(BnC). 

From the above properties, it is straightforward to show that the following identities hold: 

(ii) Au{BnC) = {AuB)n(AuC), (12) A n (B u c) = {A n b) u (A n C), (13) A n = 0, 



(14) A U = A, (15) A n B = A U B, (16) A U 5 = A n B, and (17) A = A. 

3.2 From set theory to probability 

The probability counterpart for the universe in set theory is the sample space S. Similarly, 
probability focus on events, which are subsets of possible outcomes in the sample space. 



Example Suppose we wish to compute the probability of getting an even value in a die 

roll. The sample space is the universe of possible outcomes S = {1, 2, 3, 4, 5, 6}, whereas the 

event of interest corresponds to the set {2, 4, 6}. 
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To combine events, we employ the same rules as for sets. Accordingly, the event AU B 
occurs if and only if we observe an outcome that belongs to A or to B, whereas the event 
Ar\B occurs if and only if both A and B happen. It is also straightforward to combine more 
than two events in that U? =1 Ai occurs if and only if at least one of the events A^ happens, 
whereas l~l" =1 Aj holds if and only if every event A t occur for % — 1, . . . , n. In the same vein, 
the event A occurs if and only if we do not observe any outcome that belongs to the event 
A. Finally, we say that two events are mutually exclusive if A D B — 0, that is to say, they 
never occur at the same time. Mutually exclusive events are analogous to mutually exclusive 
sets in that their intersection is null. 
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3.2.1 Relative frequency 

Suppose we repeat a given experiment n times and count how many times, say ua and 
tib, the events A and B occur, respectively. It then follows that the relative frequency of 
event A is /a = tia/ti, whereas it is fs = ns/n for event B. In addition, if events A and 
B are mutually exclusive (i.e., A fl B = 0), then the relative frequency of C = A U B is 
fc = ( n A + n B )/n = f A + Jb- 

The relative frequency of any event is always between zero and one. Zero corresponds 
to an event that never occurs, whereas a relative frequency of one means that we always 
observe that particular event. The relative frequency is very important for the fundamental 
law of statistics (also known as the Glivenko-Cantelli theorem) says that, as the number of 
experiments n grows to infinity, it converges to the probability of the event: j a -^ Pr(A). 
Chapter 5 discusses this convergence in more details. 

Example The Glivenko-Cantelli theorem is the principle underlying many sport compe- 
titions. The NBA play-offs are a good example. To ensure that the team with the best odds 
succeed, the playoffs are such that a team must win a given number of games against the 
same adversary before qualifying to the next round. 

3.2.2 Event probability 

It now remains to define what we exactly mean with the notion of probability. We associate 
a real number to the probability of observing the event A, denoted by Pr(A), satisfying the 
following properties: 

PI < Pr(A) < 1; 

P2 Pr(«S) = 1; 

P3 Pr(A U B) = Pr(A) + Pr(£) if A n B = 0; 
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P4 Pr(U™ =1 Ai) = Y^=i P r (A) if the collection of events {A iy i = 1, . . . , n) is pairwise 
mutually exclusive even if n — > oo. 

It is easy to see that P4 follows immediately from P3 if we restrict attention to a finite 
number of experiments (n < oo). From properties PI to P4, it is possible to derive some 
important results concerning the different ways we may combine events. 

Result It follows from PI to P4 that 

(a) Pr(0) = 0, 

(b) Pi(A) = l-Pr(A), 

(c) Pi(A UB) = Pi(A) + Pi(B) - Pi(A n B), and 

(d) Pi(A) < Pi(B) HACB. 

Proof: (a) By definition, the probability of event A is the same as the probability of the 
union of A and 0, viz. Pi(A) = Pi(A U 0). However, A and are mutually exclusive events 
in that AH0 = 0, implying that Pi(A) = Pr(A) + Pr(0) by P3. (b) By definition, AuA = S 
and A n A = 0, and so Pr(<S) = Pi(A U A) = Pr(A) + Pr(A) = 1 by P2 and P3. (c) It 
is straightforward to observe that AUB = AU(BnA) and that A f] (B f] A) = for the 
event within parentheses consists of all outcomes in B that are not in A. It thus ensues that 
Pr(A U B) = Pr (a U (B f\ A)\ = Pi (A) + Pi(B fll). We now decompose the event B into 
outcomes that belong and not belong to A: B = (Afl B) U (B C\A). There is no intersection 
between these two terms, hence Pi(B) — Pi(AnB) = Pi(BnA), yielding the result, (d) The 
previous decomposition reduces to B — A U (B D A) given that A D B — A. It then follows 
that Pi(B) = Pi(A) + Pi(B fl A) < Pi(A) in view that any probability is nonnegative. ■ 

3.2.3 Finite sample space 

A finite sample space must have only a finite number of elements, say, {a±, a% ■ ■ ■ , a n }. Let 
Pj denote the probability of observing the corresponding event {%■}, for j — 1, . . . ,n. It is 
easy to appreciate that < pj < 1 for all j = 1, . . . , n and that X]?=i Pj = 1 given that the 
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events (ai, . . . , a n ) span the whole sample space. As the latter are also mutually exclusive, 
it follows that Pr(A) = pj t + . . . , +Pj k = J2 r =i Pjr f° r A = {aj 1 , • • • , <%}> with 1 < k < n. 

Example: The sample space corresponding to the value we obtain by throwing a die is 
{1, 2, 3, 4, 5, 6} and the probability pj of observing any value j G {1, . . . , 6} is equal to 1/6. 

In general, if every element in the sample space is equiprobable, then the probability of 
observing a given event is equal to the ratio between the number of elements in the event 
and the number of elements in the sample space. 

Examples 

(1) Suppose the interest lies on the event of observing a value above 4 in a die throw. There 
are only two values in the sample space that satisfy this condition, namely, {5, 6}, and hence 
the probability of this event is 2/6 = 1/3. 

(2) Consider now flipping twice a coin and recording the heads and tails. The resulting 
sample space is {HH, HT, TH, TT}. As the elements of the sample space are equiprobable, 
the probability of observing only one head is ^shh htthtt\ = V^ = 1/2- 

These examples suggest that the most straightforward manner to compute the proba- 
bility of a given event is to run experiments in which the elements of the sample space are 
equiprobable. Needless to say, it is not always very easy to contrive such experiments. We 
illustrate this issue with another example. 

Example: Suppose one takes a nail from a box containing nails of three different sizes. 
It is typically easier to grab a larger nail than a small one and hence such an experiment 
would not yield equiprobable outcomes. However, the alternative experiment in which we 
first numerate the nails and then draw randomly a number to decide which nail to take 
would lead to equiprobable results. 
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3.2.4 Back to the basics: Learning how to count 

The last example of the previous section illustrates a situation in which it is straightforward 
to redesign the experiment so as to induce equiprobable outcomes. Life is tough, though, 
and such an instance is the exception rather than the rule. For instance, a very common 
problem in quality control is to infer from a small random sample the probability of observing 
a given number of defective goods within a lot. This is evidently a situation that does not 
automatically lead to equiprobable outcomes given the sequential nature of the experiment. 
To deal with such a situation, we must first learn how to count the possible outcomes using 
some tools of combinatorics. 



Multiplication Consider that an experiment consists of a sequence of two procedures, 
say, A and B. Let n^ and ub denote the number of ways in which one can execute A and B, 
respectively. It then follows that there is n = uaUb ways of executing such an experiment. 
In general, if the experiment consists of a sequence of k procedures, then one may run it in 
n = Yl i=1 Ui different ways. 
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Addition Suppose now that the experiment involves k procedures in parallel (rather 
than in sequence). This means that we either execute the procedure 1 or the procedure 2 
or ... or the procedure k. If rii denotes the number of ways that one may carry out the 
procedure i G {1, . . . , k}, then there are n — n\ + ■ ■ ■ + n k = J2i=i n i wa Y s of running such 
an experiment. 

Permutation Suppose now that we have a set of n different elements and we wish to 
know the number of sequences we can construct containing each element once, and only once. 
Note that the concept of sequence is distinct from that of a set, in that order of appear- 
ance matters. For instance, the sample space {a, b, c} allows for the following permutations 
(abc,acb,bac,bca,cab,cba). In general, there are n! = n?=n( n ~~ J) possible permutations 
out of n elements because there are n options for the first element of the sequence, but only 
n — 1 options for the second element, n — 2 options for the third element and so on until we 
have only one remaining option for the last element of the sequence. There is also a more 
general meaning for permutation in combinatorics for which we form sequences of k different 
elements from a set of n elements. This means that we have n options for the first element 
of the sequence, but then n — 1 options for the second element and so on until we have only 
n — k + 1 options for the last element of the sequence. It thus follows that we have n\/(n — k)\ 
permutations of k out of n elements in this broader sense. 

Combination This is a notion that only differs from permutation in that ordering does 
not matter. This means that we just wish to know how many subsets of k elements we can 
construct out of a set of n elements. For instance, it is possible to form the following subsets 
with two elements of {a,b,c,d}: {a,b}, {a, c}, {a,d}, {b, c}, {b, d}, and {c,d}. Note that 
{b, a} does not count because it is exactly the same subset as {a, b}. This suggests that, in 
general, the number of combinations is inferior to the number of permutations because one 
must count only one of the sequences that employ the same elements but with a different 

ordering. In view that there are n\/(n — r)\ permutations of k out of n elements and k\ ways 
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to choose the ordering of these k elements, the number of possible combinations of k out of 
n elements is 



k) (n-k)\kV 

Before we revisit the original quality control example, it is convenient to illustrate the 
use of the above combinatoric tools through another example. 

Example: Suppose there is a syndicate with 5 engineers and 3 economists. How many 
committees of 3 people one can form with exactly 2 engineers? Well, we must form commit- 
tees of 2 engineers and 1 economist. There are ( 2 ) ways of choosing 2 out of 5 engineers, 
whereas there are (J ways of choosing 1 out of 3 economists. Altogether, this means that 
one can form ( 2 ) (J = 30 committees with 2 engineers and 1 economist out of a group of 5 
engineers and 3 economists. 

We are now ready to reconsider the quality control problem of inferring the number of 
defective goods within a lot. Suppose, for instance, that a lot has n objects of which rid are 
defective and that we draw a sample of k elements of which kd are defective. We first note 
that there are Q) ways of choosing k elements from a lot of n goods, whereas there are Q d ) 
ways of combining kd defective goods from a total of rid defective goods within the lot as 
well as ClZ^j ways of choosing (k — kd) elements out of the (n — rid) non-defective goods 
within the lot. Accordingly, the probability of observing kd defective goods within a sample 
of k goods is 

/ k \ ln~n d \ 

\k d ) \k-k d ) /o i \ 

(3 ( ' 

if there are rid defective goods within a lot of n objects. 

3.2.5 Conditional probability 

We denote by Pr(A\B) the probability of event A given that we have already observed event 
B. Intuitively, conditioning on the realization of a given event has the effect of reducing the 
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sample space from S to the sample space spanned by B. 

Examples 

(1) Suppose that we throw a die twice. In the first throw, we observe a value equal to 6 
and we wish to know what is the probability of observing a value of 2 in the second throw. 
In this instance, the fact that we have observed a value of 6 in the first throw has no impact 
in the value we will observe in the second throw for the two events are independent. This 
means that the first value brings about no information about the second throw and hence 
the probability of observing a value of 2 in the second throw given that we have observed 
a value of 6 in the first throw remains the same as before, that is to say, the probability of 
observing a value of 2: 1/6. 
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(2) Next, consider A = {(x 1 ,x 2 )\x 1 + x 2 = 10} = {(5, 5), (6,4), (4, 6)} and B = 
(Oi,x 2 )|xi > x 2 } = {(2,1), (3,2), (3,1),- ■• ,(6,5)}. The probability of A is Pr(A) = 
3/36 = 1/12, whereas the probability of B is Pi{B) = 15/36. In addition, the probabil- 
ity of observing both A and B is Pr(v4 n B) = 1/36. It thus turns out that the probability 
of observing A given B is Pt(A\B) = 1/15 = Pr(A n B)/Pr(B), whereas the probability of 
observing B given A is Pi(B\A) = 1/3 = Pi(A n B)fPr(A). 

It is obviously not by chance that, in general, Pr(A\B) = Pr(Ar\B) /Pr(B) , for Pr(B) > 0. 
By conditioning on event B we are restricting the sample space to B and hence we must 
consider the probability of observing both A and B and then normalize by the measure of 
event B. It is as if we were computing the relative frequency at which the event A occurs 
given the outcomes that are possible within event B. This notion makes sense even if we 
consider unconditional events. Indeed, the unconditional probability of A is the conditional 
probability of A given the sample space S, i.e., Pr(v4|iS>) = Pi(A fl <S)/Pr(«S) = Pr(A). 
Finally, it is also interesting to note that we may decompose the probability of A fl B into a 
conditional probability and a marginal probability, namely, Py(A fl B) = Pr(A\B) Pr(B) = 
Pi(B\A)Pr(A). 

Example: Suppose that a computer lab has 4 new and 2 old desktops running Windows 
as well as 3 new and 1 old desktops running Linux. What is the probability of a student 
to randomly sit in front of a desktop running Windows? What is the likelihood that this 
particular desktop is new given that it runs Windows? Well, there are 10 computers in the 
lab of which 6 run Windows. This means that the answer of the first question is 3/5, whereas 
Pr(new|Windows) = Pr(new n Windows)/Pr(Windows) = (4/10)/(6/10) = 2/3. 

Figure 3.2.5 illustrates a situation in which the events A and B are mutually exclusive and 
hence A fl B = 0. In this instance, the probability of both events occurring is obviously zero 
and so are both conditional probabilities, i.e., Pr(Ar\B) = =>- Pr(A\B) = Pr(B\A) = 0. In 
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contrast, Figure 3.3 depicts another polar case: A C B. Now, Pi{A n B) — Pr(A), whereas 
Pr(A\B) = Pr(A)/Pr(B) and Pr(B\A) = 1. 

Decomposing a joint probability into the product of a conditional probability and of a 
marginal probability is a very useful tool, especially if one combines it with partitions of the 
sample space. Let Bi, . . . , B^ denote a partition of the sample space S, that is to say, 

(a) B i nB j = iJ}, l<i^j<k 

(b) uf =1J B 4 = S 

(c) Pi(Bi) > 0, 1 < i < k. 

This partition yields the decomposition A = (A D B^ U (A D B 2 ) U . . . U (A n B k ) for any 
event A e S. The nice thing about partitions is that they are mutually exclusive and hence 
(A n Bi) n(AD B 3 ) = for any 1 < i ^ j < k. This means that Pr(A) = ^J =1 Pr(A n Bi) = 

^^(Ampm). 




Figure 3.2: Venn diagram representing two mutually exclusive events A (oval in blue) and 
B (oval in red) within the sample space (rectangle box). 
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Figure 3.3: Venn diagram representing events A (oval in purple) and B (oval in red and 
purple) within the sample space (rectangle box) such that A D B — A. 

For instance, if we define the sample space by the possible outcomes of a die throw, we 
may think of several distinct partitions as, for example, 

(a) Bi = {i} for i = 1, . . . , 6 

(b) B 1 = {l,3,5},fl 3 = {2,4,6} 

(c) 5 1 = {1,2}, J B 2 = {3,4,5}, J B 3 = {6}. 



Example: Consider a lot of 100 frying pans of which 20 are defective. Define the events A = 
{first frying pan is defective} and B = {second frying pan is defective} within a context of 
sequential sampling without reposition. The probability of observing event B naturally 
depends on whether the first frying pan is defective or not. Now, there are only two possible 
outcomes in that the first frying pan is either defective or not. This suggests a very simple 
partition of the sample space based on A and A, giving way to Pr(B) = Pr (B\A) Py(A) + 
Pr(B\A)Pr(A). In particular, Pr(B\A) = 19/99 for there are only 19 defective frying pans 
left among the remaining 99 frying pans if A is true. Similarly, Pr(B\A) = 20/99, whereas 
Pr(A) = 1/5 and Pr(A) = l-Pr(A) = 4/5. We thus conclude that Pr(S) - 19 l ' 20 4 



Down l oad fr ee e Books at bookboon.com 



99 5 ' 99 5 5' 



30 



Statistics for Business and Economics 



Basic principles of probability 



In some instances, we cannot observe some events, and hence we must infer whether 
they are true or false given the available information. For instance, if you are in a building 
with no windows and someone arrives completely soaked with a broken umbrella, it sounds 
reasonable to infer that it is raining outside even if you cannot directly observe the weather. 
The Bayes rule formalizes how one should conduct such an inference based on conditional 
probabilities: 

Pt(BAA) = , v ' %) — — z = l,..-,Jfe 

where Bi,...,B). is a partition of the sample space. In the example above, we cannot 
observe whether it is raining, but we may partition the sample space (i.e., weather) into 
B = {it is raining} and B = {it is not raining}, and then calculate the probability of B given 
that we observe event A = {someone arrives completely soaked with a broken umbrella}. 
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The Bayes rule has innumerable applications in business, economics and finance. For 
instance, imagine you are the market maker for a given stock and that there are both 
informed and uninformed traders in the market. In contrast to informed traders, you do not 
know whether news are good or bad and hence you must infer it from the trades you observe 
in order to adjust your bid and ask quotes accordingly. If you observe much more traders 
buying than selling, then you will assign a higher probability to good news. If traders are 
selling much more than buying, then the likelihood of bad news rises. The Bayes rule is the 
mechanism at which you learn whether news are good or bad by looking at trades. 

3.2.6 Independent events 

Consider for a moment two mutually exclusive events A and B. Knowing about A gives 
loads of information about the likelihood of event B. In particular, if A occurs, we know for 
sure that event B did not occur. More formally, the conditional probability of B given that 
we observe A is Pi(B\A) = Pr(A n B)/Pr(A) = given that A n B = (see Figure 3.2.5). 
We thus conclude that A and B are dependent events given that knowing about one entails 
complete information about the other. Following this reasoning, it makes sense to associate 
independence with lack of information content. We thus say that A and B are independent 
events if and only if Pi(A\B) = Pr(A). The latter condition means that Pr(A PI B) — 
Pt(A\B)Pt(B) = Pi(A)Pr(B), which in turn is equivalent to say that Pi(B\A) = Pr(B) 
given that Py(A D B) — Pi(B\A) Pi(A) as well. Intuitively, if A and B are independent, the 
probability of observing A (or B) does not depend on whether B (or A has occurred) and 
hence conditioning on the sample space (i.e., looking at the unconditional distribution) or 
on the event B makes no difference. 

Example: Consider a lot of 10,000 pipes of which 10% comes with some sort of in- 
dentation. Suppose we randomly draw two pipes from the lot and define the events A\ = 
{first pipe is in perfect conditions} and A 2 = {second pipe is in perfect conditions}. If sam- 
pling is with reposition, then events Ai and A 2 are independent and so Pr(y4x n A 2 ) = 
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Pr(v4i) Pt(A2) = (0.9) 2 = 0.81. However, if sampling is without reposition, then Pr(v4i fl 
A 2 ) = Pi(A 2 \Ai) Pr(Ai) = 0.9 ||||, which is very marginally different from 0.81. 

This example illustrates well a situation in which the events are not entirely independent, 
though assuming independence would simplify a lot the computation of the joint probability 
at the expenses of a very marginal cost due to the large sample. This is just to say that 
sometimes it pays off to assume independence between events even if we know that, in 
theory, they are not utterly independent. 

Problem set 

Exercise 1. Show that 

Pr(A UBUC) = Pr(A) + Pr(5) + Pr(C) 

- Pr(A HB)- Pr(A n C) - Pr(B n C) 

+ Pr(AnBnC). 

Solution We employ a similar decomposition to the one in the proof of (c). In particular, 



(AUB)UC = (AUB)U(CnAUB). As the intersection is null, 



Pi(A UBUC) = Pr(A UB) + Pr(C H AU B). 
We now decompose the event C into outcomes that belong and not belong to A U B: 

c = (c n {A u £)) u (c n aub) , 

yielding Pr (C n A U B) = Pr(C) -Pr(cn(AU B)) . So far, we have that 

Pi(A UBUC) = Pi(A UB) + Pr(C) -Pi(cn(AU B)\ 

= Pi(A) + Pv(B) + Pr(C) - Pr(A n B) - Pi(c n (A U B) 



It remains to show that the last term equals to Pi(A D C) + Pi(B DC)- Pr(A flBnC). 

To appreciate this, it suffices to see that C C\ (A U B) = (A n C) U (B D C), which gives way 
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to Pr(c n (A U B)\ = Pr(A n C) + Pr(5 n C) - Px({A n(J)n(5n C)\ by P3. The last 
term is obviously equivalent to Pr(v4 n B fl C), completing the proof. ■ 

Exercise 2. Consider two events A e B. Show that the probability that only one of these 
events occurs is Pr(A U B) — Pi{A fl B). 

Solution Let C denote the event in which we observe only one event between A or 
B. It then consists of every possible outcome that it is in A U B and not in A fl B. It 
is straightforward to appreciate from a Venn diagram that C = (A U B) — (A fl B) = 



(A U B) fl A n .B = (A n -B) U (A fl B). The last representation is the easiest to manipulate 
for it involves mutually exclusive events. In particular, it follows immediately from (c) that 
Pr(C) = Pi{A) + Pr(B) - 2 Pi(A U B) = Pr(A U B) - Pi(A n B). ■ 

Exercise 3. There are three plants that produce a given screw: A, B, and C. Plant 

A produces the double of screws than B and C, whose productions are at par. In addition, 
quality control is better at plants A and B in that only 2% of the screws they produce 
are defective as opposed to 4% in plant C. Suppose that we sample one screw from the 
warehouse that collects all screws produced by A, B, and C. What is the probability that 
the screw is defective? What is the probability that the defective screw is from plant Al 
Solution: Let A = {screw comes from plant A}, B = {screw comes from plant B}, 
C = {screw comes from plant C}, and D = {screw is defective}. Given that A's production 
is twofold, it follows that Pr(A) = 1/2 and that Pr(B) = Pr(C) = 1/4. We now decompose 
the event D according to whether the screw comes from A, B, or C. The latter forms 
a partition because if a screw comes from a given plant it cannot come from any other 
plant. In addition, there are only plants A, B, and C producing this particular screw. The 
decomposition yields 

Pr(D) = Pt(D\A) Pt(A) + Pr(D\B) Pi(B) + Pr(L>|C) Pr(C) 

= 0.02 - + 0.02 - + 0.04 - = 0.025. 
2 4 4 
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To answer the second question, we must apply the Bayes rule for we do not observe whether 
the screw comes from a given plant, but we do know whether it is defective or not. So, 
the conditional probability that the screw is from A given that it is defective is Pr(A|.D) = 

2/5. ■ 
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Chapter 4 

Probability distributions 

4.1 Random variable 

Dealing with events and sample spaces is very intuitive, but it is not very easy to keep track 
of things if the sample space is large. That is why we next introduce the notion of random 
variable, which entails a much easier approach to probability theory. 

Definition: X(s) is a random variable if X(-) is a function that assigns a real value to 
every element s in the sample space S. 

Example: Suppose we flip twice a coin and define the sample space as the sequence of 
heads and tails, that is to say, S = {HH,HT,TH,TT}. Let X denote a random variable 
equal to the number of heads: 

0, if s 6 {TT} 
X(s) = I 1, if s e {HT,TH} 

2, if s e {HH}. 
Note that every element in the sample space corresponds to exactly one value of the random 
variable, though the latter may assume the same value for different elements of the sample 
space. 

4.1.1 Discrete random variable 

If X is a discrete random variable, then it takes only a countable number of values. This 
means that, in practice, we may consider a list of possible outcomes x\, . . . ,x n (even if n — > 
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oo) for any discrete random variable X. Denoting the probability of observing a particular 
value by Pi = p(xi) = Pr(X = Xi), it follows that pi > for i = 1 . . . , n and that Y17=i P* = ^- 
The function p(-) is known as the probability distribution function of the discrete random 
variable. For instance, a random variable following a discrete uniform probability function 
Pi = 1/n, with n finite, is the random- variable counterpart of equiprobable events. Figure 
4. 1 displays the probability distribution function of a discrete uniform random variable over 
the set {1,2,..., 10}. 

Example: Suppose that a mutual fund buys and holds a given stock as long as price 
changes are nonnegative. Let stock prices follow a random walk such that the probability 
of observing a negative price change is 2/5. Define the sample space S and the random 
variable N according to the number of periods that are necessary to observe the mutual 
fund unwinding its position: S = {1, 01, 001, 0001, . . .} and N = {1, 2, 3,4,.. .}. It is easy 
to see that N — n if and only if we observe a negative price change in the nth period after 
(n — 1) periods of nonnegative returns. In addition, the random walk hypothesis implies that 
returns are independent over time, and so 

3V" 1 2 



Pi(N = n)=i-\ - n=l,2,... 

Just as a sanity check, let's test whether the above probability function sums up to one if 
we consider every possible outcome: 

y Pr(iV = n) = -(l + - + - + ...)= ?-J-j- = 1. 
Z^ v > K \ ' K ' 95 / 5 1 -3/5 

n=l 



5 V 5 25 ) 51-3/5 
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Figure 4.1: The left and right axes correspond to the probability distribution function and 
cumulative probability distribution function of a uniform distribution over {1, 2, . . . , 10}, 
respectively. 
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Binomial distribution 

A Bernoulli essay is the simplest and most intuitive of all probability distribution functions. 
It restricts attention to a binary random variable that takes value one with probability p, 
otherwise it is equal to zero (with probability 1 — p). Consider now a random variable that 
sums up the values of n independent Bernoulli essays. The probability distribution function 
of such a variable is by definition binomial. 

Example: Suppose that a production line results in defective products with probability 
0.20. A random draw of three products leads to a sample space given by 

S = {DDD, DDN, DND, NDD, NND, NDN, DNN, NNN}, 

where D and N refer to defective and non-defective products, respectively. The ordering 
does not matter much in most situations and so the typical random variable of interest is 
the number of defective goods X e {0, 1, 2, 3}. The probability distribution function of X 
then is p = 0.8 3 , p 1 = 3 x 0.2 x 0.8 2 , p 2 = 3 x 0.8 x 0.2 2 , and p 3 = 0.2 3 . 

In the example above, it is readily seen from the sample space that there is only one 
manner to obtain either three defective goods or three non-defective goods. In contrast, 
there are three different ways to observe either one or two defective products due to the fact 
that the ordering does not matter. It is precisely the latter that explains why the binomial 
distribution function involves the combinatoric tool of combination. 

Definition: Consider an experiment in which the event A occurs with probability p = 
Pt(A) and so Pr(A) = 1 — p. Run such an experiment independently n times. The resulting 
sample space is S = {all sequences a±, . . . , a n }, where a, is either A or A for i — 1, . . . , n. The 
random variable X that counts the number of times that the event A occurs has a binomial 
distribution function B(n,p) with parameters n (namely, the number of independent essays) 
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and p (namely, the probability of event A). The binomial distribution is such that 

p x = Pr(X = x) = r) p x (1 - p) n ~ x , x = 0, 1, • • • , n. (4.1) 



To show that (4.2) is a probability distribution function, it suffices to confirm that it 
sums up to one given that p x > 0. As expected, 

n n , x 

x=0 x=0 ^ ' 

= [P+(1- P )} n = l, 

where the last equality comes from Newton's binomial expansion (hence the name of the 
distribution) . 

Problem set 

Exercise 1. Paolo Maldini challenges Buffon for a series of 20 penalty kicks. In the first 
10 penalty kicks, Maldini scores with probability 4/5. However, as from the 11th attempt, 
Maldini's age kicks in and the probability of scoring reduces to 1/2. Assuming that the 
outcomes are independent among themselves, compute the probability that Maldini scores 
exactly k goals. 

Solution: Each penalty kick corresponds to a Bernoulli essay with probability p\ = 4/5 
of success in the first 10 attempts and P2 = 1/2 from then on up to the 20th penalty kick. 
We thus split the problem into scoring k\ goals in the first 10 attempts and k — k\ goals in 
the second 10 penalty kicks. The former leads to a binomial distribution £>(10, 4/5), whereas 
the latter to a binomial £>(10, 1/2). Putting together gives way to 

O ^ (i - pi)io ~ ki x G -0 p ^ ki (i ■ pi)W ~ k+ki 

]°) 0.8 fcl 0.2 10 ^ x ( , 10 , ^ 0.5 10 . 
k\) \k-kij 

It now remains to sum up the ways at which Maldini can score k goals by scoring exactly 

k\ goals in the first 10 attempts. To this end, we must first consider whether k > or not, 
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yielding a probability of scoring k penalty kicks of 

min{fc,10} 






fci=max{0,fc-10} 

We sum from max{0, fc — 10} because if k > 10 then Maldini should score at least K — 10 
in each series of 10 attempts. Similarly, we sum up to min{&, 10} because if k < 10, then 
Maldini cannot score more than k goals in each series of penalty kicks. ■ 

Exercise 2. Consider a random variable I6{0,1,2,...} such that 

Pr(X = t) = (1 - a)a*, t = 0, 1,2,--- 

(a) For which values of a (4.5) indeed is a probability distribution function? 

(b) Show that for any positive integers s and t, Pr(X > s + t\X > s) = Pr(X > t). 
Solution: 

(a) It follows from YltLo^ 1 ^ = t) = 1 that (1 — a) Et^n a * = ^ ^he ^ter involves an 
infinite sum of a geometric progression, which converges to (1 — a) Et^o a * = i^~ a ) iz~ = 1 
only if a belongs to the unit interval. 

(b) It follows from the fact that s + 1 > s > that 

Pr(X> S + t)__Er=, + m( 1 -«K 



Pr(X>s + t|X>s) 



pr(x> S ) Er= s+ i(i-«x 

E r °l, + t + i" r (l-a)- 1 ^^ 1 
Er= s+ i« r (l-a)~io*w 



at. 



To complete the proof, it suffices to show that Pr(X > t) = a*. That is indeed the case 
because 



Pr(X >t) = J2( l - «K = (1 - «) Yl 



oo 

r 



a 

a' 
= (l-a)- = a', 

1 — a 

where the penultimate equality comes from the fact that the infinite sum of a geometric 
progression is equal to the first term of the progression divided by one minus the quotient 

of the progression. ■ 

Download free eBooks at bookboon.com 

41 



Statistics for Business and Economics 



Probability distributions 



4.1.2 Cumulative probability distribution function 

The goal not always lies on computing a pointwise probability. We are very often interested 
in understanding how likely it is to observe a range of values, e.g., the probability of observing 
X < x. This motivates us to define the cumulative distribution function as 



F x {x) = Pr(X < x) = J2p(xj), Vxj < 



x. 



Note that Fx is a non decreasing step function in x given that Fx(xi) < Fx(x2) if xi < x%- 
In addition, the cumulative distribution function belongs to the unit interval given that 
lim a .^_ 00 Fx(x) = and lim^oo F x (x) = 1. Finally, if X G {xi, x 2 , . . . \x\ < x 2 < . . .}, then 
Pr(X = x n ) = Pr(X < x n ) - Pr(X < x n _0 = F x (x n ) - F x {x^i). 



Example: Let X G {x\,X2,x{\ withp(xi) = |, p(x 2 ) = |, andp(x3) 
distribution function then reads 



h. The cumulative 



^x(^) 



0, if — oo < x < X\ 

1/3 if xi < x < X2 

1/2 if x 2 < x < x 3 

1 if x 3 < x < oo. 
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4.1.3 Continuous random variable 

We say that a random variable is continuous if the probability of observing any particular 
value in the real line is zero. Accordingly, the notion of probability distribution function is 
meaningless and we have to come up with something a bit different, though with a similar 
interpretation. The analog of the probability distribution function for continuous random 
variables is the probability density function. To understand the latter, we first note that, 
within the context of continuous random variables, it only makes sense to talk about the 
probability of observing a value within a given interval. The probability density function 
then measures the mass of probability of an infinitesimal interval, that is to say, Pr(x < X < 
x + Ax) for a very small Ax > 0. In the following definition, we formalize such a notion. 

Definition: The probability density function fx{~) of a random variable X is such that 

(a)f x {x)>Q, VxeM 

(b)J^fx(x)dx = l 

(c) Pr(a < X <b) = J a f x (x) dx for -oo < a <b < +00. 

Note that condition (a) corresponds to the restriction that the probability distribution 
function is positive for every element in the sample space, whereas (6) is analogous to the 
imposition that the probability distribution function sums up to one if evaluated at every 
element in the sample space. Finally, (c) reflects the fact that the probability of observing a 
particular value is zero given that Pr(X = xn) = f^° fx(x) dx = 0. Of course, the fact that 
an event A has probability zero does not mean that it is impossible to observe it. It just 
means that it is improbable. For instance, imagine that we are measuring how much time 
one takes to run 100 meters. The probability of observing a value of precisely 10 seconds is 
zero ex-ante, for there is a continuum of values around 10 seconds, though it could well take 
exactly 10 seconds ex-post. A corollary is that 

Pr(a < X < b) = Pr(a < X < b) = Pr(a < X < b) = Pr(a < X < b). 
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Examples 

(1) Let a random variable satisfy the following probability density function 

2x, if < x < 1 
fx(x) 

0, otherwise. 

We first note that fx{%) > for any value of x G M. and that 

"+oo pi pi 

fx(x)dx— / fx(x)dx= / 2xda; = x 2 \ Q = 1. 
Jo Jo 

In addition, if we wish to compute the probability of observing a given interval, say, x < 1/2, 

then it follows that 

Pr[X<-)= / ' 2zdx = x 2]1/2 - 



27 Jo l0 4' 

Finally, we may also compute the conditional probability of observing a value within an 
interval given that we know it belongs to a larger interval. For instance, 



11' - 2|V2 

< X < 



1 ^ ^ 2 V- Pr (H x ^) _. x Va 



1/3 

ax -3 , if 1 < a; < 3 



Pr r"2r^3; Pr(i<x<|) X 2|V3 i 2 - 

(2) Let X denote a random variable with density function fx(%) 

otherwise. 

It is easy to appreciate that fx(%) ^ for all x G R as long as a > 0. In addition, to ensure 
that it integrates up to one over the real line, a must equal 9/4 given that 



-3 1 a 

ax ax - 



4a 
~9~' 



2x 2 

As before, the interest sometimes lies on calculating the probability of observing X within 
a given interval. The cumulative distribution function of a continuous random variable is 



F x (x) = Pt(X < x) = I f x (u)du. 

J — 00 



The cumulative distribution function is a nondecreasing continuous function given that 

Fx(xi) < Fx{x2) if x\ < X2- The continuity is in contrast with the step-like feature in 

the case of discrete random variables. It is as if the height of the steps shrink to zero 
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as the random variables moves from a discrete to a continuous nature, so that Fx be- 
comes a continuous function. Given that we are now treating with a continuous random 
variable, the sum of the pointwise probabilities becomes the integral of the density func- 
tion. As before, Fx is such that lim a; _ ) ._ 0O i 7 x(x) = and lim^^oo Fx(x) = 1. In ad- 
dition, Pr(xo < X < x\) = Pr(X < x\) — Pr(X < Xq) = Fx{x\) — Fx(xo) for any 
— oo < xq < X\ < oo. Finally, it also follows from the definition of an integral that 



f*( x ) = ~h F x( x ) = F 'x( x ) for ever y x G 



Examples 

(1) Let X denote a random variable with cumulative distribution function 



F x (x) 



1 — e x , if x > 
0, if x < 0. 



It is evident that linr^-oo Fx{x) = and that lmx^oo Fx{x) = 1, whereas differentiating 
the cumulative distribution function gives way to fx{x) = F' x (x) - 



e x , if x > 



0, if x < 0. 
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(2) Let X denote an exponential random variable with density function 

f \e~ Xx , if x > 
fx(x) ={ ~ 

[0, if x < 0. 

Note that the previous example is a particular case of the exponential distribution with 

A = 1. Now, consider the probability of observing X within the interval (t, t + 1): 

Pi(t < X <t+l) = F x (t + 1) - F x {t) = / Ae" Ax d2 
= -e- A *|; +1 = e- Ai (l- e - A ). 

Letting a = e _A then yields Pr(t < X < t + 1) = (1 — a) a 4 , which is the probability 
distribution function of the memoryless discrete random variable of the second exercise of 
the problem set in Section 4.1. 
(3) Let X denote a random variable with density function 

{62(1-2), if < 2 < 1 
otherwise. 

As before, it is easy to see that fx(%) > for every x G 1 and that J 6x(l — x)dx = 
(3x 2 — 2 2 3 )| = 3 — 2 = 1. As for the cumulative distribution function, integrating the 
density function up to 2 yields 

f 0, if 2 < 

F x (x) = I x 2 (3 - 22) if < 2 < 1 
1 if 2 > 1. 

We may employ the latter to compute the probability of observing a value within an interval 

as well as a conditional probability as, e.g., 

/ 1 1 2\ (32 2 -22 3 )|;/3 3/4 - 1/4 - 1/3 + 2/27 10 

V "2 3 < < 3 J " ( 3x 2_ 2^3)12/3 4/3-16/27-1/3 + 2/27 13' 

Uniform distribution 

The simplest continuous distribution function is the uniform. It essentially dictates that 

intervals of same length are equiprobable within the support of the random variable, say 
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[a, 0\. The corresponding density function is 

( 1/(0 -a) ifa<x<(3 
fx{x) = < . 

otherwise, 

which implies that Pr(a < X < b) = |=| = F x (6) - F x (a) with -oo <a<a<6</5<oo. 
Finally, the cumulative distribution function reads 



F x (x) 



if — oo < x < a 
xj{fl — a) if a < x < (3 

1 if (3 < x < oo 



Example: Let X denote a random variable that is uniformly distributed in the unit 
interval. The density function then is 

,| if < x < 1 
fx(x) 

otherwise, 



j i 



whereas the cumulative distribution function is 



r x (x 



if -oo < x < 
x if < x < 1 

1 if 1 < x < oo. 



There are several applications for the uniform distribution. For instance, if we wish 
to generate a random variable from a particular distribution, we may always start with a 
uniform distribution and then transform it to obtain the desired distribution. This is possible 
because, for any random variable X with cumulative distribution function Fx, Fx(X) has 
a uniform distribution in the unit interval. More advanced applications include resampling 
techniques (e.g., bootstrap) and prior distributions in Bayesian analysis. 

4.1.4 Functions of random variables 

Consider a random variable X : s h- > X(s) = x for any s 6 5 and a transformation 

H : x i— > H(x) = y that maps a realization x of the random variable X into a real value 
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y G R. Instead of transforming the realization x, one may also transform the random 
variable X, giving way to another random variable Y = H(X) : s *— > H[X(s)]. Given 
that the randomness of Y is completely due to X, it is possible to compute the probability 
distribution of Y if we know the probability distribution function of X. 



Example: Let Y = H(X) = 2X + 1 with X following a standard exponential distribution 
fx{x) — e ~ x 1{ X > 0)) where 1{A) is an indicator function that takes value one if A is true, 
zero otherwise. Given that X is positive, the support of Y is given by the interval [1, oo). It 
then follows that 



F Y (y) = Pr(Y <y) = Pr(2X + 1 < y) = Pr [X < 



1 x 



y-i 



(y-i)/2 



d X =(-e~X 



y-i 

2 

(v-l)/2 



1-e 



(i-y)/2 





1 






1 
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In general, we deal with transformations of a random variable according to whether the 
latter is discrete or random. If X is a discrete random variable, then it is easy to appreciate 
that a random variable Y = H(X) will also be discrete regardless of the transformation H. 
Letting X G {x\, ■ ■ ■ ,x n , ■ • • } yields Y G {yi = H(x\), ■ • ■ , y n — H(x n ), ■ ■ ■}■ In addition, 
Pr(y = y^ = Pt(X = Xi) if the transformation H is such that each y corresponds to a 
unique value x. 

Example: Let X G {-1,0, 1} with Pr(X = -1) = 1/3, Pr(X = 0) = 1/2, and Pr(X = 
1) = 1/6. If Y = X 2 , then Pr(Y = 0) = 1/2 and Pr(T = 1) = 1/2. 

Letting x ik denote the values of X such that H(x ik ) = yi for every k G {1,2,.. .} leads 
toPr(r = ^) = n°=iPr(^ = ^J- 

Example: Let X G {1, 2, . . . , n, . . .} with Pr(X = n) = 2~ n and let 

{1 if X is even, 
-1 if X is odd. 

It thus follows that Pr(T = l) = i + i + ... = ^^ = | and hence Pr(T = -1) = §. 

If X is a continuous random variable, then Y = H(X) is not necessarily continuous 
for discreteness can arise depending on the transformation H. Naturally, any continuous 
transformation preserves the continuity of the random variables. 

Example: Let X denote a random variable in the real line and 

f-1 if X<0, 

Y= { 

[l if X > 0. 

In this instance, Py(Y = y^) = J A fx{x)dx, where A denotes the event about X that 
corresponds to {Y = y^, that is to say, A is either the negative real line or the nonnegative 
real line. 

In general, to derive the probabilistic structure of Y given X, we must first compute the 
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distribution function Fy(y) = Pr(Y < y) by means of the events in X that correspond to 
{y < y} and then differentiate with respect to y to obtain the density function fy. Finally, 
we must also determine the support of Y by seeking the values of y for which fy(y) > 0. 

Example: Let X denote a continuous random variable with density function 

f 2x if <x< 1, 
fx{x) = { (4.2) 

I otherwise. 

Letting Y = H(X) = 3X + 1 then yields 

y-i 



F Y (y) = Pr(Y <y) = Pr(3X + 1 < y) = Pr I X < 
-(j/-i)/3 H?/-i)/3 



/ fx(x)dx= / 2xdx=x 

Jo Jo 



2|(l/-l)/3 




V) 2 = (i-y)79. 

The density function then is fy(y) = F Y (y) = | (y — 1), whereas the support of Y" is given 
by the interval (1,4) for y = 3x + 1 with < x < 1 to ensure that both /x and /y are 
bounded away from zero. 

As an alternative, if H is different iable and strictly monotone, we could also determine 
F Y by noting that x = H~ 1 (y) and hence 

{{X < H^ 1 (y)} if H is strictly increasing 
{X > H 1 (y)} if H is strictly decreasing. 

It then suffices to appreciate that 



F Y (y) = Pr(F < y) = Pr [H(X) < y] = Pr [H ^ i/" 1 

1 — Fx [H^ 1 (y)} if if is strictly decreasing 
Fx [if~ 1 (y)] if H is strictly increasing. 



As for the density function, 



Mv) = -^yMy) 



dF x dx 



dx dy 



fx(x) 



dx 



dy 
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where x = H~ 1 (y). 

Examples 

(1) Let us revisit the previous example in which the density function of X is given by (4.2) 
and Y = H(X) = 3X + 1. The cumulative distribution function of Y is Fy(y) = Pr(Y < 
y) — F [(y — l)/3], whereas the density function is fy(y) = fx(x) ^ , with x = (y — l)/3. 
It then follows that fy(y) = § (y — 1)- Just as a sanity check, we next check whether the 
latter integrates to one over the support of Y. If x G (0, 1), then Y = 3X + 1 G (1,4) and so 

(2) Letting now Y = H(X) = e~ x yields 

F Y (y) = Pt(Y <y) = Pr(e" x <y) = Pr(X > - In y) 
= [ 2xda;= {x 2 )\\ lny = l-(-lny) 2 . 

J-lay 

Differentiating with respect to y then leads to fy(y) = F Y (y) = —(2Yay)/y. As for the 

support, we confirm that Y E (1/e, 1) by showing that J 1/e — dy = 1. Needless to 

say, applying the alternative methods results in the same expressions for the cumulative 
distribution and density functions. Indeed, 

Fy(y) = Pr(F < y) = Pr(X > - \ny) = 1 - F x {- lny), 

whereas fy(y) = fx(%) jf 1 = _ (2hi £/)/?/ given that x = —Iny. 

(3) Let X denote a random variable with density function 

f 1/2 if - 1 < x < 1 
I otherwise. 

The cumulative distribution function of Y = X 2 then is 

F Y (y) = Pr(F <y) = Pr(X 2 <y) = Pr(-^/ < X < y/y) = F x (^/y) - F x (-^/y), 
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whereas the density function is 

fx(Vv) fx(-Vv) 



f Y {y) = F^y) 



1 /l 1 



-2y/y ^y/y 



[fx(Vy) + fx(-Vy)] 



2^y\2 2) 2^/y 
with a support given by the unit interval. 
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4.2 Random vectors and joint distributions 

A random vector X = (Xi,...,X n ) is a vector of, say n, random variables. There are 
three types of random vectors: continuous, discrete, and mixed. The latter is essentially 
a vector including both continuous and discrete random variables, and hence we will fo- 
cus only on continuous and discrete random vectors. In what follows, we will consider a 
bivariate random vector (X,Y), though extending the discussion to n-dimensional random 
vectors is straightforward. The joint probability distribution function of a discrete bivari- 
ate random vector (X, Y) is p(x, y) = Pr(X — x,Y — y), where X G {x\, . . . , x n , . . .} and 
Y = {yi, . . . , y n , . . .}. Similarly, the joint density function of a continuous bivariate random 
vector (X, Y) is Jxy{x, y) ~ Pr (x < X < x + Ax, y <Y < y + Ay) for small enough Aa: 
and Ay. As before, the density function is such that j'xy{x, y) > for every (x, y) G M 2 and 
that f-oo I-oo fxy( x ' y)dxdy = l. 

Examples 

(1) Suppose there are two shoemakers in a shop. The first does at best 5 shoes in a given 
month, whereas the second takes more time to make a shoe and hence does at most 3 shoes 
per month. Let X G {0, 1, . . . , 5} and Y G {0, 1, . . . , 3} denote the number of shoes by the 
first and second shoemakers in a given month, with probabilities given by 



X =0 1 2 3 4 5 

Y = (101 O03 (105 QM (L09 

1 0.01 0.02 0M OM QM QM 

2 0.01 0.03 0.05 0,05 0,05 QM 

3 0.01 0.02 0.04 0.06 0.06 0.05 



If the interest lies on the event B = {X > Y}, for instance, it then suffices to sum up the 

probabilities that appear in boldface, resulting in Pi(B) = 3/4. 

(2) The shop owner is a bit concerned with the amount of leather each shoemaker employs 

per month. Let X and Y now denote the quantity of leather the first and second shoemakers 
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spend, respectively. The shop owner is so miser that he gauges with infinite precision how 
much leather the shoemakers use, and so we may assume that (X, Y) is a continuous bivariate 
random variable. In addition, the joint density function is given by 

{a if 5 < x < 10 and 4 < y < 9 
otherwise. 

We first compute a by integrating the density function over IR 2 : 

/OO /'OO 

/ fxy(x,y)dxdy = 1 
-OO J —OO 

f-9 rlO p9 /-9 

I l adxdy = (ax)\l° dy = / 5a dy 

J A J5 J A J A 

= {hay)\l = 2ha = l, 

implying that a = 1/25. Next, let's compute how likely is the event B = {X > Y} that the 
first shoemaker employs more leather than the second shoemaker: 

Pr(B) = 1 - Pr(X < Y) = 1 -- / [ ^-dxdy 



'5 Jb 

25' 



1 f 9 
l -25j 5 {y - 5)dy 



The first integral is over the interval [5, 9] because it is impossible to observe Y > X if 
4 < Y < 5 given that X > 5, whereas the second integral is over the interval [5,y] because 
X cannot exceed the value that we observe for Y given that Y > X. 
(3) Let (X, Y) denote a bivariate random variable with joint density function 

{x 2 + xy/3 if ) < x < 1 and < y < 2 
otherwise. 

We first show that the above density function integrates to one: 



1 P 2 / 1 \ /•! 



x 2 + - xy ) dy dx = \ ( x 2 y + - xy 2 

l 



i Jo \ 3 / Jo V " 



2 

dx 

o 

i 



2 2 \ (2 o 1 2 

2x H — a; dx = - x H — x 
o v 3 j U 3 

2 1 
- + - = 1. 

3 3 



o 
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We next illustrate how to compute the likelihood of an event that involves both X and Y, 
say, the probability of observing {X + Y < 1}: 



+ Y < l) = Pr(l 


r < i - X) 










= / / (x 2 + -xyjdydx = ( x 2 y + - xy 2 J 


l-x 


dx 




-f 

Jo 


x 2 (l — x) H — x(l — x) 2 
6 


dx = 

Jo 


x 2 - x 3 + - (x - 2x 2 + x 3 ) 
6 


f 1 (2 2 5 3 1 \ (2 3 5 4 1 \ 

= / - x x H — x dx = \ - x x H x 

7o V3 6 6 J \9 24 12 J 


1 




2 
~9~ 


5 1 16-15 + 6 
24 12 " 72 


7 
" 72' 









dx 



In general, it follows that the joint probability distribution function of a continuous 
random vector X = (X\, . . . , X n ) is given by 



Fx(x) = Pt(X < x) 



X\ fX 



oo J — oo 



fx(xi,...,x n )dxi--- dx T 



with x — (xi, . . . , x n ), whereas the joint density function is 



d d n 

fx{x) = M FX{X) = dxV^dx- n Fx{Xl -- Xn) - 
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4.3 Marginal distributions 

Knowing the joint distribution function Fxy of (X, Y) also implies the knowledge of the 
marginal distributions Fx and Fy of X and Y, respectively. After all, it contains all in- 
formation about the probabilistic structure of both X and Y. To extract the marginal 
distributions of X and Y, it suffices to 'integrate' the other random variable out of the joint 
distribution. We employ quotation marks because integrating out a discrete random vari- 
able, say Y, corresponds to summing the joint probability for all possible values of Y given 
that any of these values may occur. Letting Bj = {X = x,Y = yj} for j = 1,2,... then 
yields p(x) = Pr(X = x) = Pi(B 1 or • • • or B n or • • • ) = YlT=iPi x iV3)i gi ven that these 
events are all mutually exclusive. 

As for continuous random variables, the marginal density function of X is 

fx(%) ~ Pr(x < X < x + Ax) for a very small Ax > 

/oo 
fxy{x,y) dy, 
-oo 

and hence 

rb poo 

Pr(a < X < b) = Pr(a < X < b, -oo < Y < oo) = / / fxr(x, y) dy dx. 

J a J — oo 

Examples 

(1) Let (X, Y) denote a bivariate random vector with joint density given by 

J 2(x + y-2xy) if 0<x,y< 1 
fxY[x,y) = < 

I otherwise. 

We first confirm that (4.3) indeed is a density function by showing that it integrates to one: 

/ / 2(x + y — 2xy) dxdy = / (x 2 + 2xy — 2x 2 y) | dy 
Jo Jo Jo 

•i 

dy=l. 



n 
It is evident from the above derivations that the marginal distributions of X and Y are both 

uniform in the unit interval, even though their joint distribution is not uniform. 
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(2) Let (X,Y) denote a uniform random variable in the rectangle [cxxiPx] x [ a y>/3y]- The 
joint density function then is 

{775 VTH \ if «X < X < By, «V < V < Qy 
otherwise. 

Integrating X out yields a uniform density function in the interval [ay,j3y] for Y, whereas 
integrating the latter out gives way to a uniform distribution function for X in the interval 

[a x ,Px\- 

(3) Suppose now that (X, Y) is uniform in B = {(x, y) : < x < 1, x 2 < y < x}. It follows 
from the fact that the area of B is f Q (x — x 2 ) dx = g that 

- , , fe if (a?,y)6J3 

JxY{x,y) = < 

(0 if {x,y)£B, 

otherwise the joint density would integrate to something different from one. The marginal 
density of X then is fx{x) — J 2 6dy — 6x(l — x) forO<x< 1, whereas the marginal of Y 
is f Y (y) = jfQdx = 6(^y - y) for < y < 1. 

The above examples illustrate that a uniform joint distribution does not ensure uniform 
marginals, just as uniform marginals do not imply a uniform joint distribution. 

4.4 Conditional density function 

In the case of discrete random variables, by definition, it suffices to compute the ratio between 
the probability of observing both events and the probability of observing the conditioning 
event, so that p(x\y) = p(x,y)/p(y) if p(y) > and p(y\x) = p(x,y)/p(x) if p(x) > 0. Note 
that the conditional probability meets all of the conditions for a probability distribution 
function, namely, it is nonnegative for all values of x and sums up to one given that 



^2pfa\y) = '%2 



i=l 



~i p(y) 

-. oo 



p(y) 
p(y) fbi^ v p(y) 
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Example: Let us revisit the example of the two shoemakers and compute the prob- 
ability of the first shoemaker to produce two shoes in a given month given that the second 
shoemaker has also produced two shoes. By the definition of conditional probability, it 
suffices to compute the ratio between the probability of observing both events and the prob- 
ability of observing the conditioning event, so that 

Pr(X = 2,F = 2) 0.05 1 



Pr(X = 2\Y = 2) 



Pr(F = 2) 0.25 



As for continuous random variables, it is easy to see that the same applies to density 
functions given that they proxy for the probability of observing the random variable within an 
interval of infinitesimal length. It thus ensues that fx\y(x\y) = fxy(x, y)/fy{y) if fy(y) > 
and that fy\x(y\x) — fxy(x,y)/fx(x) if fx(x) > 0. As before, the conditional density 
function indeed is a density function in that fx\y{x\y) > for every ieK and 

fxy(x,y) 



OO POO 



OO J — OO 



fx\y(x\y)dx= / - f _\ dx 



fy(y) 



fy(y) __ L 



fxy(x,y)dx 



fy(y) J-oc ' fy{y) 

Example: Let (X, Y) denote a random vector with joint density function given by 

, , , \x 2 + \xy if(x,y)e[0,l]x[0,2] 
fxy{x,y) = < 

I otherwise. 

The conditional density of X given Y then is 



fxy(x,y) _ = x 2 + \xy = x 2 + \xy 

fy(y) " / X (x 2 + \xy)dx" (| x 3 + \ x 2 y) \\ 
x 2 + \ xy 6x 2 + 2xy 

T +l = 2+y ' 



3 ' 6 

for < x < 1 and < y < 2. 

4.5 Independent random variables 

There is a nice correspondence between independent events and independent random vari- 
ables. The condition that Pr(v4 f| B) — Pt(A)Pt(B) translates into p(x,y) = p(x)p(y) if 
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X and Y are discrete random variables and into fxy{x,y) = fx{x)fY{y) if X and Y are 
continuous random variables. That is to say, independence ensues if and only if the joint 
probability/density is the product of the marginals. In addition, we can also define indepen- 
dence by means of conditional probabilities/densities in that independence holds if and only 
if (a) p(x\y) = p(x, y)/p(y) = p(x) and p(y\x) = p(x, y)/p(x) = p(y) if X and Y are discrete; 
and (b) f x \y(x\y) = fxy(x,y)/ fviy) = fx(x) and f Y \x(y\x) = fxr(x,y)/ ' fx{x) = fy(y) if 
X and Y are continuous. 

Examples 

(1) Let X and Y denote the time it takes to observe a transaction for two stocks, with joint 
density function given by 

I exp [-(a; + y)} ifx>0,y>0 
fxY{x,y) = < 

I otherwise. 

It is easy to see that fxY(x,y) = fx(x)fy{y), with fx(x) = e~ x for x > and fy(y) = e~ y 
for y > 0, and hence X and Y are independent random variables. 

(2) Let now X and Y denote random variables with joint density given by Jxy(x, y) = 8xy 
for < x < y < 1. Although it is easy to decompose fxY into the product of functions that 
depend exclusively on either X or Y, there is no way to get rid of the dependence in the 
support. The fact that X is always inferior to Y is what makes the two random variables 
dependent. 

Finally, it is interesting to observe more closely how the link between independent events 
and independent random variables works. Let A and B denote independent events concerning 
the random variables X and Y, respectively. Equivalence follows from the fact that 

Pr(AnB) = J Jf XY (x,y)dxdy = J J f x (x)f Y (y)dxdy 

AC\B AC\B 

= J f x (x) dx J Bf Y (y) dy = Pr(A)Pr(B). 
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4.6 Expected value, moments, and co-moments 

Another way to characterize a distribution is through its moments. The first moment refers to 
the expected value of the distribution, whereas the second moment relates to the dispersion of 
the distribution. There is also room for higher-order moments in that there are distributions 
with an infinite number of moments. For instance, the third moment tells us whether the 
distribution is asymmetric around the expected value, whilst the fourth moment determines 
how thick the tails of the distribution are, that is to say, how likely it is to observe an extreme 
realization regardless of whether to the right or to the left of the expected value. 
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4.6.1 Expected value 

The first moment of a distribution is known as expected value or population mean. As the 
name implies, it entails a typical value for the distribution. Before formalizing the notion of 
expected value, it is convenient to start with an example to motivate the discussion. 

Example: Gimli proposes a game to his friend Legolas with prizes in gold pieces as in the 
table below. In addition, if Legolas decides to play, he must pay beforehand one gold piece 
to Gimli. 

die throw 1 2 3 4 5 6 

Legolas' payoff -3-2-1125 

Legolas' expected payoff then is the average payoff minus the entry costs, that is to say, 
E(Legolas' payoff) = |(5 + 2 + l — 1— 2 — 3) — 1 = — |. This means that the game is not 
very fair to Legolas given that, on average, he will have to pay 2/3 of a gold piece to Gimli. 
To make the game fair, Legolas would have to bargain down the entry cost to 1/3. 

In the above example, it is easy to compute the expected value because the outcomes 
of a die throw are equiprobable and hence it suffices to compute the arithmetic mean of 
the possible outcomes. In general, the values that the random variable can assume are 
not equiprobable and hence we must weigh by their mass of probability. We do that by 
applying the expectation operator E(-) to the random variable of interest. In the case of 
discrete random variables, the expectation operator is such that E(X) = Y^i x iP( x i)- ^ 
the latter series does not converge to some finite value, then we say that the distribution has 
no expected value. 

Examples 

(1) Let X ~ B(n,p), so that Pr(X — x) — \)p x (1 — p) n ~ x . The expected value of a 
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binomial distribution is 

e(x) = J2 x Pr ( x = x ) = Yl x ( n ) p x t 1 - p^ x 

*— f \x I *-^ x\[n — x)\ 



x=l 

II 



^( x -l)Kn-x)\ pX{1 pT 
Letting k = x — 1 then yields 

E(X) = n'f2( n ~ 1 )p k+1 (l-p) 

'E( n "V(i-p) 

fc=0 ^ ' 

given that ("^ )p fc (l — p) n_1 ~ fe is the probability distribution function of a B(n — l,p). Note 
that the expected value of the binomial distribution corresponds to the absolute frequency 
at which we observe a value equal to one. 

(2) Suppose that a given trading strategy normally entails a weekly return S, though it 
may profit less if there is a sequence of bad news. More precisely, the return reduces to 
< R < S if there are up to two bad news, and to L < if there are three or more 
bad news over the week. The probability of observing a bad news is 1/20 in any given 
trading day of the week. Let now X G {L, R, S} and B e {1,2,3,4,5} denote the weekly 
return of this trading strategy and the number of bad news over the week, respectively. 
The latter is a binomial random variable with probability distribution function given by 
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Pr(£ = k) = (l){l/20) k (19/20) n ~ fc , and hence the expected value of X is 

E(X) = Pr(B = 0) S + [Pr(5 = 1) + Pr(5 = 2)]R+ Pr(B > 3) L 
= (19/20) 5 S + [5(l/20)(19/20) 4 + 10(l/20) 2 (19/20) 3 ] R 

+ [l0(l/20) 3 (19/20) 2 + 5(l/20) 4 (19/20) + (1/20) 5 ] L 
= (19/20) 5 ,S + 5(l/20)(19/20) 3 [19/20 + 2(1/20)] R 

+ (1/20) 3 [l0(19/20) 2 + 5(l/20)(19/20) + (1/20) 2 ] L 
= 0.7737809 S + 0.2250609 R + 0.0011581 L. 

Note that the weights assigned to each outcome of the weekly return sum up to one for they 
correspond to their probability of occurring. 



As for continuous random variables, the expectation operator is such that E(X) = 
J^xfxix) dx. As before, if the function g(x) = xfxix) is not integrable, then the dis- 
tribution features no expected value. 
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Examples 

(1) Let X denote a uniform random variable in the interval [a, 8}. In view that the density 

function is 

-5^— if a < x < 8 

[3— a " 



fx(x) 



it follows that 



E(X) 



x 



8 — a 



dx 



x 2 /2 
8 — a 



otherwise 



8 2 -a 2 (8-a)(8 + a) (3 + a 



2(8 -a) 2(8 -a) 2 

This makes sense for the uniform continuous distribution is analogous to the case of equiprob- 
able events and hence it suffices to take the arithmetic mean of the lower and upper limits 
of the support. 
(2) Let X denote a random variable with density function given by 



fx(x) 



xj22h if < x < 15 

(30 - x)/225 if 15 < x < 30 
otherwise. 



The expected value then is 



; 15 x 2 , f 30 30 - x )x , 1 

E(X)=: dx+ —dx = 

225 Lc 225 225 



n 



15 



15 



+ 15a^ - 



X' 



-.'A) 



15 



15 2 



15 3 



15 x 30 2 - 15' 



30 3 15 5 



5 + 60 - 15 - 40 + 5 = 15. 



This result is pretty intuitive given that the density function looks like a symmetric triangle 
with peak at 15. 

Suppose now we wish to derive the expected value of Y — H(X). The expectation 
operator is such that 



E(Y) 



YlT=i ViPiVi) if discrete 



Y^lLi H(xi)p(xi) if discrete 



f™ yfY(y)dy if continuous JZ° H(x)fx(x)dx if continuous. 

This means that there is no need to derive the distribution of Y to compute its expected 

value. It suffices to compute the expectation of H(X) given the density function of X. 
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Example: In some situations, the interest lies on the magnitude of the random variable 
regardless of the sign it takes. Suppose, for instance, that X has a double exponential density 

given by 

f i e x if x < 
fx(x) = {* 

y\e~ x if x > 0, 

which is symmetric around zero. Now, the expected value of Y = \X\ is 



fO roc 

E(y)= / \x\f x (x)dx 



9 

oo ^ 



\x\ e x dx + J \x\ e x dx 

oo JO 

poo 

(— x)e x dx + I xe~ x dx 
oo Jo 



oo 

xe~ x dx = 1, 
o 



where the last equality follows from integration by parts. Alternatively, one can compute 
the expected value of Y by first deriving its distribution. In particular, 

F Y (y) = Pr(Y <y) = Pr(|X| <y) = Pr(-y < X < y) = 2 Pr(0 < X < y) 

= 2 [ f x (x) dx = 2 f e~ x dx = (-e~ x ) \ y Q = 1 - e" y , 
Jo io 

giving way to f Y (y) = e~ y for y > and to E(F) = J °° yf Y (y) dy = J °° ye~ y dy = 1. 

It is possible to compute the expected value of a function of a random vector, say Z = 
H(X,Y), along the same lines. More precisely, 

}Y%Li z iP( z %) if discrete j ^Zi H{x u yi)p{x u y%) if discrete 

v ) = 1 oo = I oo 

[ JToo z fz( z ) dz if continuous I J^ H (x,y)fxy(x, y) dx dy if continuous, 
which avoids the derivation of the probability/density function of Z. Apart from that, the 
expectation operator has two other interesting properties. First, it is a linear operator in 
that K(aX + b) = a X]"=i ^(-^i) + ^ f° r an y fi xe d constants a and b, if X = Y17=i -^*- Second, 
if the random variables are independent, then the expectation of their product is equal to 
the product of their expectations. This means that, if X and Y are independent, then 
K(XY) = E(X)E(y). The examples below employ these properties to derive expectations. 

Examples 

(1) A binomial distribution B(n,p) results from the sum of the outcomes of a sequence 
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of n independent Bernoulli essays with probability p. Let Y{ denote the outcome of each of 
these Bernoulli essays (i = 1, . . . ,n), taking value one with probability p, otherwise zero. 
The expected value of X then is E(X) = E(Y 1 + Y 2 + ■ ■ ■ + Y n ) = YJ!=i E ( Y i) = ™P- 
(2) Let D denote the weekly demand for apple crumbles, with probability distribution 
function p n = Pr(D = n). Let C denote the cost of baking a unit of apple crumble and E 
the cost of keeping one apple crumble. Suppose that we sell each apple crumble at a price 
P and that our initial stock is of iV apple crumbles. It then follows that our profit II in a 
given week is a random variable given by 

f N(P -C) if D > N 

)dP-NC-(N- D)E if D < N 

because in the latter case we produce N apple crumbles, sell D and then stock the ones that 
we are not able to sell during the week. This means that 

[ N(P - C) with probability Pr(£> > N) = 1 - Pr(£> < N) 

[ D(P + E)- N(C + E) with probability Pr(£> < N) 

and hence 



E(n) = N(P - C) Pr(£> > N) + E [D(P + E) - N{C + E)\D < N] Pr(£> < N) 

oo N N 

= N{P-C) Y, Pn + (P + E)J2nPn-N(C + E)Y,Pn 
n=N+l n=0 n=0 



N(P - C) 



N 



i-5> 



n=0 



N N 

+ (P + E) J2 n Pn ~ N(C + E)Y,Pr. 
n=0 n=0 

N N 



N{P -C) + {P + E)J2nPn ~ [N{C + E) + N{P - C)] ^p fl 

n=0 n=0 

N N 

N(P -C) + (P + E) ^np n - N(P + E)Y,Pn 

n=0 

N 

N(P -C)-(P + E) J2(N - n)p n . 



n=0 n=0 

N 



n=0 
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If, for instance, Pr(D = n) — A for n 6 {0, 1, ... , 9}, it then ensues that 



N 



E(n) = N(P -C)-(P + E)J2 ^77^ = N ( P ~ C ) 



n=0 



10 



10 



N 



N(N+1)-J2 



n 



n=0 



N(P - C) 



N{P - C) 



P + E 
10 



N{N + 1) 



N(N+1) 



N(P - C) - 



P + E N(N + 1) 
10 2 



A^(A^ + 1)(P + E) 
20 



The above naturally only holds if N < 9, which makes sense given that it seems unreasonable 
to start the week with more than the maximum demand for apple crumbles. 

The expectation operator is a linear operator and hence it is extremely easy to deal 
with affine functions of a random variable. Although we have already shown that it is 
also straightforward to handle non-affine functions of random variables, it is important to 
note that E[y(X)] 7^ g[E(X)] in general. For instance, the next example illustrates that 
E(X 2 ) > [E(X)] 2 for any random variable X. 
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Example: Let X denote a random variable that takes value either 1 or -1 with probability 
1/2. Given that it is symmetric around zero, the expected value of X is obviously zero and 
hence [E(X)] 2 = 0. In contrast, the expected value of its square exceeds zero given that 
E{X 2 ) = (1 + 1)1/2 = 1. 

4.6.2 Variance, covariance, and correlation 

The variance measures the amount of variation and dispersion of a random variable around 
the mean value. This information is paramount to any problem of statistical inference. In 
finance, we are not only interested in the expected return of an investment, but also in how 
risky it is. Most people prefer an investment that entails a return of 10% for sure than 
one with a return of either 30% or -20% with probability 1/2 despite the fact that both 
investments have en expected return of 10%. The most basic measure of risk is given by the 
variance, which gauges the average magnitude of the deviations with respect to the mean 
value by means of a quadratic transformation: var(X) = E [X — K(X)] . 

We could of course employ the absolute value rather than the square to measure the 
magnitude of the deviation. The advantage of using squares is that it is differentiable as 
opposed to the absolute value and that it is very easy to compute by means of the expectation 
operator. The drawback is that taking squares potentiates the extreme values, if any, in the 
data. Another disadvantage of the variance is that it is not in the same unit of the random 
variable. This is however easy to remedy in that we can always consider the standard 
deviation of the random variable, namely, the square root of the variance, so as to recover 
the original unit of measurement. 

The variance is also known as the second centered moment of the distribution. We say it 
is centered because we are looking at deviations around the first moment (i.e., the expected 
value). It relates to the second moment because of the focus on the second power of the 
random variable. It is interesting to note that we can also write the variance as a function 
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of the first two (uncentered) moments of the distribution given that 

var(X) = E [X - E(X)] 2 = E (X 2 - 2XE(X) + [E(X)] 2 ) 

= E(X 2 ) - 2E(X)E(X) + [E(X)] 2 = E(X 2 ) - [E(X)] 2 . 

The properties of the expectation operator are very helpful to compute the second uncentered 
moment of the distribution in view that it is not necessary to find the distribution of the 
square of the random variable. Indeed, 

E(X 2\ _ I TZi x Mxi) if x is discrete 

I jToc x 2 fx(x) dx if X is continuous 

The properties of the expectation operator also imply a couple of properties for the 
variance. First, the variance of an affine function of X is proportional to the variance of 
X. In particular, v&r(aX + b) = a 2 var(X) for any fixed constants a and b given that 
both the second moment and the square of the first moment will depend on the square of 
the slope coefficient a 2 and the fact that we take deviations with respect to the expected 
value will take care of the intercept b. Second, in the event that X and Y are independent 
random variables, the variance of the sum is equal to the sum of the variances, that is to 
say, var(X + Y) — var(X) + vax(Y). In what follows, we demonstrate both properties by 
showing that var(aX + b + cY) = a 2 var(X) + c 2 var(F) if X and Y are independent: 

var(aX + b + cY ) = E [aX + b + cY - E(aX + b + cY)] 2 

= E [aX + b + cY - a E(X) -b- cE(Y)f 

= E {a[X - E(X)] + c[Y - E(Y)}} 2 

= E {a 2 [X - E(X)} 2 + lac \X - E(X)] [Y - E(T )] + c 2 [Y - E(Y)] 2 } 

= a 2 var(X) + c 2 var(F) + 2acE[X - E(X)} [Y - E(Y)]. 

To show that the last term is equal to zero, it suffices to appreciate that 

E[X - E(X)] [Y - E(Y)] = E(XY) - 2 E(X)E(Y) + E(X)E(Y) = (4.3) 
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given that independence between X and Y implies that the expectation of the product is 
the product of the expectations. 

Equation (4.3) suggests a simple measure of dependence between two random variables 
based on how their deviations relative to the mean co-vary. Bearing that in mind, we define 
the covariance between X and Y as 

cov(X, Y) = E[X - E(X)] [Y - E(Y)] = E(XY) - E(X)E(Y). 

The intuition is simple. If the deviations of X — E(X) tend to have the same sign as the 
deviations Y — E(Y), we then say that X and Y co-move together and hence their covariance 
is positive. In contrast, the covariance is negative if the deviations tend to have opposite 
signs. It is possible to show that the covariance is a measure of linear dependence and 
hence independence implies zero covariance (or, equivalently, orthogonality), though zero 
covariance does not necessarily imply independence. 
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Another drawback of the covariance is that it has a strange unit. It is in units of X times 
units of Y, compromising a bit interpretability. The easiest remedy for that is to standardize 
the deviations of X and Y with respect to their means by their standard deviation, giving 
way to the correlation: 

corr(A, Y) - 



vVar(X) var(F) ' 
The latter has no unit in contrast to the covariance. In addition, standardizing by the 

standard deviation is also convenient for it makes the two deviations comparable. The 

correlation also has some very nice properties. First, it is at most one in magnitude, that is, 

— 1 < corr(X, Y) < 1. Second, as the covariance, the correlation is equal to zero if and only 

if X and Y are orthogonal (or, equivalently, linearly independent). Third, as in the variance 

and the covariance, the correlation is also based on expectations and hence we can employ 

all of the apparatus that comes with the expectation operator. 

Examples 

(1) A survey classify the degree of customers' satisfaction, say X, into a scale from 
to 10. The answers to the survey indicate a symmetric probability distribution given by 
Vo — Pio = 0.05, p\ = p 2 = Ps = Pg = 0.15, ps — . . . — pi — 0.06. The expected degree of 
satisfaction then is E(X) = (1 + 2 + 8 + 9) x 0.15 + (3 + 4 + 5 + 6 + 7) x 0.06 + 10 x 0.05 = 5, 
which makes sense given that the distribution is symmetric around 5. We next compute the 
second moment of the customers' satisfaction, namely, 

E(X 2 ) = (1 + 4 + 64 + 81) x 0.15 + (9 + 16 + 25 + 36 + 49) x 0.06 + 100 x 0.05 = 35.6, 

implying a variance of var(X) = K(X 2 ) — [E(X)] 2 = 35.6 — 25 = 10.6 and a standard 
deviation of 3.25. 

(2) Let X denote a binomial random variable B(n,p) with an expected value of E(X) = np. 
Instead of computing the second moment directly from K(X 2 ) = ^22=o x<2 { n )P X ^ ~ p) n ~ x , 
we will take advantage of the definition of the binomial distribution as a sequence of n 

independent Bernoulli essays with probability p. Let Yj, with i G {1, . . . ,n}, denote these 

Download free eBooks at bookboon.com 

71 



Statistics for Business and Economics Probability distributions 

Bernoulli essays. It then follows that var(X) = var (X^i ^i) = SILi var (^i) given that all 
the covariances are zero due to the independence between the Bernoulli essays. Now, the 
variance of Y{ is var(Yj) = E(Yj 2 ) — [E(Yj)] 2 = p—p 2 =p(l—p) and hence var (X) = np(l—p). 
(3) Let X denote a uniform random variable in the interval [a, /3]: i.e., X ~ lA(a,f3). We 
know that the expected value of X is E(X) = (a + /3)/2, whereas the second moment is 

„3 



" X 2 



*X')~.I j^te 



X" 



3(/3-a) 



P (3 3 -a 3 



3{(3-a)' 



The variance of X then reads 

/? 3 -a 3 (ct + /3) 2 4{(3 3 - a 3 ) - 3(/3 - a) {a + (3) 2 
Var( >~ 3{/3-a)~ ~ 12(/3 - a) 

_4/? 3 -4a 3 -3/3 3 + 3a 3 -6a/5 2 + 6a 2 /3-3a 2 /3 + 3a/3 2 
" 12(/3 - a) 

p 3 -a 3 -3a{3 2 + 3a 2 (3 (/3 - a) 3 (/3 - a) 2 
~ 12(/3-a) ~ 12(/?-a) ~ 12 ' 

which makes sense in that, as a measure of dispersion, it depends on the square (Euclidean) 

distance between the support bounds. 

(4) Let (X, Y) denote a bivariate random variable with density function 

(2 if < x < y < 1 

fxr(x,y) = < 

I otherwise. 

The marginal densities then are fx(x) = J 2dy = 2(1 — x) for < x < 1 and fy{y) = 
J* 2dx = 2y for < y < 1, with expected values of E(X) = f Q 2x(l — x) dx = | and 
E(F) = f Q 2y 2 dy = |, respectively. As for the moments of second order, we start with the 
covariance, namely, 

cov(X, Y) = E(XY) - E(X) E(Y) = J J" 2xy dx dy - ~ = i - 2 - = i 

As for the variances, it follows that 

var(X) = E(X 2 ) - [E(X)] 2 = f 2x 2 (l - x) dx - \ = i - 1 = 1 

var(F) =E(F 2 ) - [E(F)] 2 = jT 2y 3 dy - ^ = 1 - ^ = 1 
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and hence the correlation between X and Y amounts to 1/2. 

The covariance is such that cov(aX + b, cY + d) = accov(X, Y). The intercepts b and d 
have no impact in the covariance because it deals with deviations with respect to the mean 
value and the latter obviously shifts with b and d as much as X and Y, respectively. Further, 
the correlation is completely invariant to any affine transformation in that corr(aX + b, cY + 
d) = corr(X,Y). The extra level of robustness stems from the fact that we standardize the 
deviations by the standard deviation, which changes with a and c in the same proportion as 
X and Y, respectively. 

4.6.3 Higher-order moments 

In general, we define the kth uncentered moment of a distribution as 

i y^°°i x^p(xi) if X is discrete 

^ k = E(X k )=\ ^ =1 tPK l> 

I jT^ xk fx(x) dx if X is continuous. 

Similarly, we define the centered moments as /}& — E[X — E(X)] fe . However, in most sit- 
uations, we prefer to standardize the random variable not only by subtracting the mean, 
but also by dividing by the standard deviation, so as to obtain a quantity that is compara- 
ble across different random variables. For instance, we define skewness and kurtosis as the 



standardized third and fourth moments, respectively: 



sk(X) = E 



X -E(X) 
var(X) 



■ j , 



and k(X) = E 



X - E(X) 
var(X) 



i 



The former gauges how asymmetric is the distribution relative to its mean value, whereas 
the latter measures how thick the left and right tails are. For instance, it is very well 
documented that stock returns display negative skewness and very high kurtosis, reflecting 
the fact that extreme negative returns are more frequent than extreme positive returns. In 
contrast, changes in exchange rates are typically symmetric around zero, though they also 
exhibit very high kurtosis implying thick tails. 
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4.7 Discrete distributions 

In this section, we briefly review the discrete distributions we have seen in the previous sec- 
tions and introduce a couple of other distribution functions that are often useful in practice. 

4.7.1 Binomial 

Consider a sequence of n independent experiments in which the event A may occur with 
probability p = Pr(A). The resulting sample space is S = {all sequences ai, . . . , a n }, where 
Oj is either A or A for % = 1, . . . ,n. The random variable X that counts the number of 
times that the event A occurs has a binomial distribution function B(n,p) with parameters 
n (namely, the number of independent essays) and p (namely, the probability of event A). 
The binomial distribution is such that 

Pr(X = x)= (]p x (l-p) n - x , x = 0,l,---,n. (4.4) 

The expected value of a binomial distribution is np, whereas the variance is np(l —p). Figure 
4.2 depicts the probability distribution function and cumulative distribution function of a 
binomial random variable with n — 10 and p = 0.25. 
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Figure 4.2: The left and right axes correspond to the probability distribution function and 
cumulative probability distribution function of a binomial random variable with n = 10 and 
p = 0.25, respectively. 



There are two common violations of the binomial law. The first comes in the form of 
dependent Bernoulli essays, whereas the second stems from Bernoulli essays with different 
probabilities. An example of the latter is the Exercise 1 in Section 4.1.1. As for the former, 
dependence in the essays may entail a very strong impact in the probability distribution 
function, much more than changing probabilities. Suppose, for instance, the Bernoulli essays 
exhibit positive dependence. The sum of their values X = Y\ + Y 2 + . . . + Y n will then have 
a variance of var(X) = XT=i var (^) + ^ YH=2 Cov(Yi,Yi). The first term is equal to the 
variance of the binomial distribution np(l — p), which will be dominated by the second 
term given that we are summing up n positive covariances. So, under positive/negative 
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dependence, the binomial distribution would under /overestimate the true variance of X. In 
contrast, changing probabilities will always overestimate the true variance of X as the first 
(very extreme!) example illustrates. 

Examples 

(1) Let X — Y\ + . . . + Y 2 o, where Y ly . . . , Y w are always equal to one and Y n , . . . , F 2 o are 
always equal to zero. The true variance of X is zero, though we would estimate a probability 
of 1/2 under the assumption that X is binomial and hence a variance of 5. 

(2) Let X = Y2i=i ^+Si=5i ^' where the first and second terms form binomial distributions 
i3(50, 0.3) and £>(50,0.7), respectively. Assuming a binomial distribution with the average 
probability of p = 0.5 yields an expected value of 50 and a variance of 25. Now, we know 
from Exercise 1 in Section 4.1.1 that the true probability distribution function of X is 

min(fc,50) , v , v 

p r (X = k) = Yl [kr^^^Kk -k J0.7 fc - fel 0.3 5 °- fc+fcl 



fci=max(0,fc— 50) 
min(fc,50) 



fcl=max(0,fc— 50) 

implying a variance of about 21. 



50\ / 50 
k\) \k — k\ 



y^ |W |n 3 5 °- fc + 2fc in 750+fc-2fc! 



The binomial distribution has a number of applications in practice. The most fruitful 
financial application is the binomial tree model of asset returns for derivatives pricing. The 
simplest binomial tree assumes that stock returns are independent over time taking value 
either A or —A with probability 1/2. Running such a model for a large number of periods 
yield the same solution for the price of a derivative as the Black-Scholes model. This is 
not surprising given that a binomial distribution B(n,p) converges to a normal distribution 
AM np, np(l —p)\ as the number of periods n increases (see Figure 4.3). To make the model 
more realistic, the most advanced versions of the binomial tree may include time-varying 
probabilities and/or dependence over time, which of course contradict the assumptions of 
the binomial distribution. 
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Figure 4.3: The probability distribution function of a binomial random variable resembles 
more and more the symmetric bell shape of a normal distribution as the number of essays 
increases. The plots refer to binomial distributions with p = 0.25 and n e {10, 15, 20}. 



Before applying the binomial distribution, we must make sure that the assumptions of 
constant probability and independence hold. For instance, if we perform a survey by phone 
between 16:00 and 20:00, the binomial distribution is probably not a good idea given that 
it is very likely that the audience changes with the time of the day, especially before and 
after working hours. Similarly, we cannot assume a binomial distribution if the interest lies 
on the number of stocks with negative returns in a given week. These are not independent 
events in that there are common factors that affect different stocks at the same time (e.g., 
the energy sector depends heavily on the price of oil) . 
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4.7.2 Hypergeometric 

The hypergeometric distribution arises in situations in which we draw a sample of n units 
from a population of size N consisting of two distinct groups of size N\ and N — N\, with 
n < min(iVi, iVjj), and we define X as the number of units in either one of the groups, say 
the first group. We know from Section 3.2.4 that the probability of any event A is given by 
the ratio of the number of possible outcomes in A to the total number of possible outcomes. 
Accordingly, 

fNi\ fN-Ni\ 

Pr(X = x) = K x ^ yp ' with < x < n 

\n) 

given that there are ( x 1 ) ways of choosing x from the Ni units of the first group, ( ~_ a . 1 ) ways 
of choosing the remaining n — x units from the second group, and ( 1 ways of choosing a 
sample of n units from a population of size N. Figure 4.4 displays the probability distribution 
function and cumulative distribution function of a hypergeometric random variable with 
N = 50, JVi = 25, and n = 10. 
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Figure 4.4: The left and right axes respectively correspond to the probability distribution 
function and cumulative probability distribution function of a hypergeometric random vari- 
able with N = 50, iVi = 25, and n = 10. 



The hypergeometric distribution has an expected value of E(X) = nN\/N and variance 
of var(X) = n ^ (l — ^-) J0[- K the population size is much larger than the sample size 
(i.e., N 3> n), the hypergeometric converges to a binomial distribution with probability 
p = Ni/N. Although the binomial approximation entails exactly the same expected value, 
the variance differs by a small-sample correction factor ^5f- This happens because the 
only difference between the binomial and hypergeometric distributions is that the binomial 
samples with reposition, whereas the hypergeometric samples without reposition and hence 
the probability changes in a very particular way as we keep sampling the population. See 
the quality control problem in the end of Section 3.2.4, for instance. 
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Example: Consider a population of 87 financial analysts in which 13 are from one of the 
largest financial institutions in the world. Suppose we wish to form a committee with 10 
financial analysts, but there is a concern that too many could come from this big institution. 
A rough estimate based on the binomial distribution for the probability of observing two 
financial analysts from the above financial institution is 

whereas the true probability given by the hypergeometric is 

( 1S ) f 74 ) 
Pr(X = 2) = v y 87 Y ; = 0.294. 

4.7.3 Geometric 

The setup of the geometric distribution is similar to that of the binomial. We conduct 
independent essays with probability p of success. In contrast to the binomial, the interest 
lies on how many essays are necessary to observe the next success. This means that we fix 
the number of success to one and let instead the number of essays to vary randomly. Letting 
X denote a geometric random variable yields 

Pr(X = x) = p(l - pY' 1 with x = 1, 2, 3, • • • 

The name of the distribution comes from the fact that the cumulative probability distribution 
function depends on the sum of a geometric progression. Figure 4.5 plots the probability 
distribution function and cumulative distribution function of a geometric random variable 
with p = 0.25. 

Summing all possible outcomes yields 



E P < X = *) = £p(1 " Pf- 1 = !_(!_„) = !■ 



p _ 

x=l x=l P> 



Along the same lines, it is possible to show that the expected value and variance of the 

geometric distribution are respectively 1/p and (1 —p)/p 2 - In addition, Exercise 2 of Section 
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4.1.1 shows that a geometric random variable has no memory and hence the link with the 
exponential distribution in (4.8). 
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Figure 4.5: The left and right axes correspond to the probability distribution function and 
cumulative probability distribution function of a geometric random variable with p = 0.25, 
respectively. 



4.7 .4 Negative binomial 

The last variation of the binomial distribution is the negative binomial. It has this name for 
it inverts the problem of the binomial distribution in that X denotes the number of Bernoulli 
essays that are necessary to observe k successes. If you wish, the negative binomial extends 
the geometric distribution in that we wait for k rather than only one success to occur. The 
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probability distribution function of the negative binomial is 



\ X k _\) 



with x > k. 



Note that we employ the combination (Tli) because we know that the last essay must 
result in a success. A negative binomial variate has an expected value of k/p, with variance 
of k(l — p)jp 2 ■ Figure 4.6 displays the probability distribution function and cumulative 
distribution function of a negative binomial random variable with p = 0.25 and k = 3. 
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Figure 4.6: The left and right axes correspond to the probability distribution function and 
cumulative probability distribution function of a negative binomial random variable with 
p = 0.25 and k = 3, respectively. 



4.7.5 Poisson 



The Poisson distribution provides the simplest way to model events that occur at random over 
time. It assumes that there is a constant arrival rate within a time interval and that events 
are independent over time. There are a handful of applications for the Poisson distribution 
in practice. Electricity providers could well employ a Poisson distribution to model the 
occurrence of electrical tempests in the areas they serve (though storms could exhibit spatial 
dependence). Call centers typically assume a Poisson distribution for the number of phone 
calls they receive within a given interval of time (though there could exist an underlying event 
triggering many calls at approximately the same time, see example below). Commercial 
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banks normally employ a Poisson distribution to model how many clients will default on 
their loan payments within a given month (though it is hard to argue for independence in 
periods of financial distress). 

Let X denote the number of events that occur within a given time interval. We say that 
X has a Poisson distribution with arrival rate A if its probability distribution function is 
given by 

Pr(X = x) = — , with x = 0,1,2,- •• 

x\ 

The arrival rate of A is relative to the time interval of reference. For instance, if we are 
dealing with an arrival rate A per minute, then we expect 5A events within a 5-minute time 
interval. It turns out that both the expected value and variance of a Poisson are equal to 
A. Figure 4.7 portraits the probability distribution function and cumulative distribution 
function of a Poisson random variable with an arrival rate of lambda = 5. 

Example: To make the emergency call center more efficient, we must model the number of 
incoming telephone calls to the emergency number so as to better understand the likelihood 
of events such as more than 10 phone calls within a 5-minute time interval. We would 
presumably expect that the number of incoming calls between 18:00 and 19:00 is larger than 
the number of phone calls between 04:00 and 05:00. This implies that the arrival rate of 
emergency calls is different depending on the time of the day and hence we have to apply 
different Poisson distributions for different time periods. 

Although the Poisson distribution seems very different from the binomial distribution, 
there is a close link between them. To see why, let's think about we would model the type 
of situation that calls for a Poisson distribution by means of a binomial distribution. The 
first step is to split the time interval of reference into n very short subintervals of equal 
length. The idea is to have small enough subintervals so as to ensure that the probability 
of observing more than one event within a subinterval is negligible in comparison with the 

probability of at most one occurrence. In other words, there is either or 1 event in each 
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subinterval. This paves the way to the use of a binomial distribution given that we can now 
model this sequence of subintervals as a sequence of Bernoulli trials. 

It remains to decide upon the probability p of observing the event, which should somehow 
relate to the arrival rate A of the Poisson distribution. We know that, on average, there are 
A At events within a time interval of length At. Splitting At into n subintervals of time 
yields on average np events within At. It now suffices to equate the expected number of 
events under the binomial assumption with the arrival rate of the Poisson distribution to 
obtain p = A — . Next, we consider a random variable X that counts the number of events 
within a time interval of length, say, At = 1. 




Figure 4.7: The left and right axes correspond to the probability distribution function and 
cumulative probability distribution function of a Poisson with an arrival rate of A = 5, 
respectively. 
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We first compute from the binomial distribution the probability of not observing any 
event: Pr(X = 0) = (1 — p) n = (l — -J . However, we may think of imposing n —> oo so as 
to ensure that the subintervals are small enough. This leads to Pr(X = 0) = e~ x . We next 
compute a recursive relation for the probability distribution function under the binomial 
assumption, namely, 



Pr(X = k) = (Vjp k (l-p) n ' k 



./-!(! _ p )n-fc+l P 



k\(n — k)\ 1 — p 

n \ p k-i^_ pr -k + in-k + l p 



k — 1/ k 1 — p 

n — k + l)p 



k(l-p) 
X-(k- l)p 



Pr(X = k - 1) 
Pr(X = k - 1) 



k{l-p) 

given that p = X/n. Taking limits (n —> oo or, equivalently, p — > 0) then yields Pr(X = fc) 
f Pr(X = fe- 1). In particular, Pr(X = 1) = Ae" A for k = 1, Pr(X = 2) = ^ e" A for fc = 
and so on. In general, Pr(X = fc) = 4y e _A just as in the Poisson distribution. 
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The above discussion about the intimate link between the binomial and Poisson distri- 
butions is interesting because it motivates why so many credit institutions employ a Poisson 
distribution to model the arrival rate of credit defaults. Although the binomial distribution 
is theoretically more suitable, it is difficult to handle the probability distribution function of 
a binomial random variable if n is too large. As the probability of a credit default is typically 
very small (at least in a developed economy and in periods of normal activity), this is the 
ideal setup for a Poisson approximation of the binomial distribution. 

4.8 Continuous distributions 

In this section, we briefly review the few continuous distributions that we have seen in the 
previous sections. In addition, we introduce a series of other continuous distributions, most 
of them deriving from the normal distribution. Also known as the Gaussian distribution, 
the latter is the most important distribution in statistics not only because it is often a good 
assumption in practice, but also because it naturally arises in theory to approximate the 
distribution of the sample mean of any random variable (regardless of its distribution). This 
last result is known as the central limit theorem, which we will study in Section 5.2. 

4.8.1 Uniform 

This is the continuous counterpart of equiprobable events in that any interval of a given 
length within the support of the distribution will have exactly the same probability. More 
formally, let X denote a uniform random variable in the interval [a, /?], with density function 

f ^-, if a <x< 13 
fx{x) = { *-"' ~ ~ f 

I 0, otherwise. 

We have already shown that the expected value of a uniform random variable is given by 

the average between the upper and lower limits of the support, i.e., E(X) = ^y^, whereas 



the variance is var(X) 



2 ' 

(/9-q) 2 

12 - 



Example: Suppose that the delay of a given tram is uniformly distributed between and 
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20 minutes during winter. This means that the probability of observing a delay of at least 
8 minutes is 

Pr(X > 8) = 1 - Pr(X < 8) = 1 - F X (S) 

20 1 f 8 1 

— dx — 1 — / — dx 
s 20 J 20 



l-A = i? = 3/5. 

20 20 7 



4.8.2 Exponential 



The exponential distribution arises in a setting very similar to the one of the Poisson distri- 
bution. The difference is that the interest now lies on the time spell we must wait to observe 
the next event. As in the Poisson context, we assume that events are independent over time 
and the arrival rate is constant. Applications abound in quality control, including fatigue 
and reliability analysis. In finance, there is a new strand of the literature that aims to model 
the time between trades so as to better understand market activity. In addition, it is also 
interesting to observe how much time it takes to observe a change in prices for it conveys 
information about market volatility. Finally, labor economists are keen on carrying out du- 
ration analyses so as to study unemployment spells and time to promotions. In general, the 
exponential distribution plays a major role in duration analysis regardless of whether the 
duration has an economic, financial or quality-control interpretation. 

The density function of an exponential variate is fx{x) = r e ~ Al with x > (zero 
otherwise), whereas the survival function Sx(x) = 1 — Fx(x) = Pr(X > x) is given by 

^00 rx -I 

Pr(X >x) = f x (t) dt = 1 - / - e~ t/x dt = e~ x/x . 

Jx JO * 

This naturally means that the cumulative distribution function is Fx(x) = 1 — e~ x / A , though 
it is more common in duration analysis to talk about the survival function. Both the expected 
value and the standard deviation of the exponential distribution are equal to A, which reminds 
us again of the Poisson distribution whose expected value is equal to the variance. Figure 
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4.8 portraits the probability density and distribution functions of a standard exponential 
random variable (i.e., A = 1). 

As aforementioned, the exponential distribution somewhat resembles the geometric dis- 
tribution in that it features no memory given that 

Pr(X >s + t) e~( s+t V x 



Pr(X > s + t\X > s) 



Pr(X > s) 



-s/X 



e -t/A = Pr ( X > t y 



The probability of waiting another interval of time of at least t is constant regardless of how 
much we have been waiting for. 




Figure 4.8: The left and right axes respectively correspond to the probability density and dis- 
tribution functions of a standard exponential random variable, that is to say, an exponential 
with A = 1. 
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4.8.3 Normal and related distributions 

The normal (or Gaussian) is the fundamental distribution in statistics not only because it 
naturally appears in a wide array of situation, but also because of the central limit theorems 
that ensure it provides a good approximation in large samples for the sample mean of almost 
any random variable. The normal distribution is also easy to manipulate for a number of 
reasons. First, it is completely characterized by its mean and variance, which we denote by 
\i and a 2 , respectively. Second, in contrast to the distributions we have seen so far, there is 
no connection between the mean and variance of a Gaussian variate. This confers an extra 
level of flexibility to the normal distribution, explaining why it seems to work well in all sorts 
of situations. Third, the normal distribution is close under affine transformations in that a 
linear combination of Gaussian random variables is also Gaussian. 
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The density function of a normal random variable is given by 

2 

i I x — n \ 
exp 



fx(x) 



1 ( X — fi 

2 



a 



with — oo < x < oo. 



V2 7TO- 2 

We denote a normal random variable with mean fi and variance a 2 by X ~ A/"(/U, a 2 ). 
There is no closed-form solution for the cumulative distribution function of a normal random 
variable and hence we must tabulate it. Naturally, it would be impossible to evaluate the 
distribution function of X ~ A/"(/i, a 2 ) for every value in the real line and for every mean- 
variance combination. 




Figure 4.9: The left and right axes correspond to the probability density and distribution 
functions of a standard normal random variable, respectively. 



To circumvent this problem, we tabulate only the standard normal distribution J\f(0, 1), 

which has zero mean and unit interval, given that we can obtain any other normal distribution 
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by means of a simple affine transformation: Z 



X-u 



AT {0,1) if X ~ AT(n,a 2 ). The 



standard normal distribution and density play such a major role in statistics that we denote 
them by $(•) and </>(■), respectively. The latter is <f>(z) = -A= exp (— ~ z 2 ) for zgE. Figure 
4.9 displays the probability density and distribution functions of a standard normal random 
variable. 

The normal distribution is symmetric and hence all odd centered moments are equal 
to zero. In contrast, the variance of the normal distribution determines the magnitude 
of every even centered moment. For instance, the kurtosis of the normal distribution is 
k(X) = E [— ^] = E(Z 4 ) = 3 and that's why some people refer to k(X) — 3 as excess 
kurtosis. Figure 4.10 shows precisely how increasing the dispersion affects the shape of a 
normal density function and hence the probability of observing extreme realizations. 




8 10 



Figure 4.10: The probability density function of a normal random variable with zero mean 
and standard deviation a G {1, 2, 3}. 
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Figure 4.11: The first panel plots the probability density and distribution functions of a 
chi-square random variable in the left and right axes, respectively. The second panel shows 
how the shape of the chi-square density changes with the degrees of freedom. 



The normal distribution gives way to a number of interesting distributions depending 
on how we transform it. In what follows, we discuss two distributions that derive from the 
Gaussian distribution and are of particular interest in the context of statistical inference. 
The first is the chi-square distribution, which consists of the sum of a number of squared 
independent standard normal random variables. Let Z% ~ A/"(0, 1) denote a sequence of 
mutually independent standard normal distributions for % — 1, . . . ,N. It then follows that 
X% = Yli=i Z? is a chi-square distribution with N degrees of freedom. The mean of Xn * s ^> 
whereas its variance is twofold amounting to 2N. The chi-square is important in a number 
of situations. For instance, the chi-square distribution arises in a very natural manner if 
we wish to compute the probability that a standard normal random variable belongs to 
a symmetric interval around zero. Figure 4.11 not only plots the probability density and 
distribution functions of a chi-square random variable, but also shows how it changes with 
the degrees of freedom. 
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Figure 4.12: The first panel plots the probability density and distribution functions of a 
t-student random variable in the left and right axes, respectively. The second panel shows 
how the shape of t-student density varies with the degrees of freedom. 

The second is the t-student distribution, which stems from a ratio of a standard normal 
distribution to the square root of an independent chi-square distribution divided by its 
degrees of freedom. In particular, we denote a t-student with N degrees of freedom by t^ = 
. 1 ° N = , where Z,'s are independent standard normal distributions for i = 0, 1, . . . , N. 
The t-student is symmetric around the origin and hence has mean zero. In turn, the variance 
of a t-student random variable is N/(N — 2), with N denoting the degrees of freedom. It is 
easy to see that the variance is ill-defined if there are not enough degrees of freedom in that 
N < 2 implies a negative variance. This is true in general for the t-student distribution in 
that the kth moment exists if and only if the degrees of freedom exceed k. We will see later 
that the t-student distribution is paramount to hypothesis testing under the assumption of 
normality. 

Figure 4.12 depicts the probability density and distribution functions of a t-student ran- 
dom variable with 3 degrees of freedom in the first panel, whereas the second panel illustrates 
how the shape of its density changes with the degrees of freedom. In particular, it is notice- 
able that the t-student converges to a standard normal distribution as the degrees of freedom 
increase. 
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Chapter 5 
Random sampling 



They way we collect the data is extremely important because, ideally, we would like to end 
up with a random sample. The latter consists of independent and identically distributed 
(iid) observations from the population. This is the ideal setup for two reasons. If data are 
independent, then it is easy to characterize the joint distribution because it is equivalent to 
the product of the marginals. In addition, if the data come from a common distribution, 
the marginals are all identical, thereby depending on the same vector 6 of parameters. We 
formalize these ideas using the joint density function of a random sample X = (X±, . . . , Xn), 
namely, fx(xi, ■ ■ ■ , %n) = Y\ i=1 fxX x i) = Ili=i fx{xf, 0). The first equality follows from in- 
dependence, whereas the second ensues from the fact that the elements of the random sample 
are identically distributed. We typically represent a random sample by Xi ~ iid fx{-\ 0) for 
i = l,...,N. 

Data collection is not easy. It is indeed quite difficult to design a data sampling procedure 
that is free of any bias. The most common problems in business and economics relate 
to censorship, selection, survivorship, and no-answer biases. Censorship bias takes place 
whenever we cannot observe data within a given interval. For instance, if there are price 
limits in a stock exchange, we cannot observe stock prices either above the upper limit or 
below the lower limit given by the maximum daily oscillation. In such a context, censoring 
may occur for, even if the equilibrium price moves above or below the limits, we observe at 
most a price at one of the limits. The similar problem arises in exchange-rate target zones. 
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The only difference is that a successful speculative attack against the currency may cause 
the break of the target zone (i.e., price limits). A very different situation in which censorship 
plays a major role is in labor economics. For instance, if we are measuring how much time it 
takes for an individual to obtain a promotion, our data set will invariably include individuals 
who haven't been promoted yet. So, the most we can say is that the time to promotion of 
these individuals is larger than the number of periods we have been observing them in our 
sample. 

Selection bias occurs in the event that the sampling procedure is such that the data 
tend to come from a specific group within the population. For instance, in most developing 
countries, women must decide whether they join the work force or stay at home full time 
taking care of the house chores, whereas men rarely have such an option. This means that we 
cannot directly compare the salary of male and female workers. They are different not only 
in gender, but also because female workers have taken a previous decision to join the labor 
market, whereas men didn't. Taking such a decision shows some degree of ambition and 
determination that is probably correlated with productivity and hence with salary. That is 
why a simple comparison of wage differentials will normally underestimate the discrimination 
against women in the labor market. The same reasoning applies to immigrants, as well. The 
simple fact that they have taken a previous decision to migrate, while others in the same 
situation didn't, indicates that they form a different group (perhaps more ambitious, focused, 
and determined). In finance, selection bias may also affect asset returns through a liquidity 
(rather than competition) channel. Illiquid assets by definition trade less frequently than 
liquid assets and hence they are much more likely to exhibit price staleness. If transaction 
prices do not change, then the return is zero. However, zero returns are not really reflecting 
the true change in the value of the asset in that they are merely an artifact due to illiquidity. 
This means that we cannot treat zero returns and nonzero returns in the same way. 

Survivorship bias arises whenever we are looking at a sample of individuals/firms/units 

resulting from some sort of competition. Although it is much more natural to think about 
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survivorship bias in biology, where evolution establishes intense competition among different 
genes, examples abound in economics, finance, and management. For instance, hedge funds 
that perform poorly end up managing less funds that those doing well. As most indices 
are weighted by assets under management, they tend to reflect more the performance of 
the hedge funds that perform well over time, overestimating the overall performance in the 
industry. In fact, hedge funds that perform systematically poorly end up closing their doors, 
thereby disappearing from databases at a certain point in time. Survivorship bias then arises 
very strongly if we collect data by sampling the returns of all funds that currently exist since 
some date in the past. To avoid such a bias, we must first choose the starting data and then 
collect the data of all funds were operational since then. In this way, the data set would 
include not only the successful funds that are still in action, but also those that did do very 
well and ceased to exist. 

The no-answer bias is very common in surveys. People who have strong opinions, es- 
pecially negative, are typically much more inclined to answer a survey. That's why most 
lecturers are very keen to publicize teaching evaluation surveys to students. If they don't, 
it is very likely that the answers will have a negative bias for students that do not have 
many criticisms will presumably not bother to answer the survey as much as the students 
who are not happy. In work environments, we could well argue that workaholics tend not to 
respond work-unrelated surveys (e.g., menu of the eatery) for they prefer to dedicate their 
time to more productive tasks (it is also very likely that they bring their own sandwich from 
home to spend less time in lunch breaks!). This means that work- unrelated surveys will not 
represent entirely the views of the population in the firm due to the lack of answers from 
the workaholics. 

Example: Suppose a commercial bank would like to work out how many days on average 
it takes to process a cheque. The cost of sampling all cheques is prohibitively high and hence 
the person in charge decides to draw a sample of 1,000 cheques. The snag is how to draw a 
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random sample given that we cannot simply tag all cheques with a number and then draw 
from a discrete uniform distribution. One solution is to draw every mth cheque that the 
bank processes until we observe 1,000 cheques. This type of sampling is not entirely random 
in that we will never observe two consecutive cheques (from the same firm, perhaps!) in the 
sample, but it should do the trick reasonably well. 

The above example illustrates well the fact that random sampling is an abstract notion. 
In some situations, it is virtually impossible to draw a completely random sample, and hence 
we must do with samples that are " random enough" . The next series of examples are more 
concrete in that they establish alternative procedures to draw a random sample in all sort 
of setups. 



Need help with your 
dissertation? 



Get in-depth feedback & advice from experts in your 
topic area. Find out what you can do to improve 
the quality of your dissertation! 



Get Help Now 




Go to www.helpmyassignment.co.uk for more info 



&H 



Helpmyassignment 



Download free eBooks at bookboon.com 



98 



^ 



Click on the ad to read more 



Statistics for Business and Economics Random sampling 

Examples 

(1) Suppose a marketing firm wishes to interview 1,000 households in a town. One solution 
is to draw from a discrete uniform distribution random numbers that identify households 
given their addresses (or post codes). The interviewer then visit the address between 15:00 
and 18:00 and, if no one answer, that we eliminate that address and replace by another 
drawn at random from the same discrete uniform distribution. 

(2) Suppose a bookstore in the university campus wishes to evaluate the stock of textbooks 
before the beginning of the term. We may draw a random number that identifies a given 
location in the shelves and then check the textbook in that location as well as the 50 closest 
textbooks. 

(3) A manager would like to assess the performance of the cleaning staff. Her assistant 
comes up with two assessment strategies. The first involves inspecting 15 offices completely 
at random, whereas the second strategy chooses one office at random from each of the three 
floors of the building and then inspect them as well as the 9 offices closest to each of them. 
The time to completion is the same for both strategies despite the fact the second strategy 
inspects the double of offices. The manager decides for the first strategy for it really entails a 
random sample. She rightly explains to her assistant that, even though the second strategy 
seems more efficient at first glance, it is less convenient for it would not generate a random 
sample. The reason is simple. The allocation of the cleaning staff is typically by wing and 
floor, and hence, by restricting attention to a given area of the floor (i.e., the neighborhood 
of the sampled office), she would risk to assess the work of a particular subset of janitors 
rather than the overall performance of the cleaning staff. 

5.1 Sample statistics 

Let X = (Xi, . . . ,Xn) denote a random sample with density function fx(') &). We don't 

know the true value of the vector 6 of population parameters and so the goal is to infer 

it from the realization of the random sample, say, x = (x\, . . . ,x n ). Note that x is just 
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one possible realization for the random sample X with probability mass given by the joint 
density function evaluate at x, namely, fx(x; 6) = Yl i= i fx{%i\ #)• 

Example: Let the random variable X come from a discrete uniform distribution that takes 
integer values between 1 and 6. This means that the sample space is {1,2,3,4,5,6}, whose 
elements are drawn with probability 1/6. Suppose now that we take three random samples 
of three observations, say (1, 1, 4), (2, 4, 5), and (2, 3, 2). Their sample means are respectively 
2, 11/3, and 7/3, though the expected value of A" is (1 + 2 + 3 + 4 + 5 + 6)/6 = 7/2. 

What we wish to illustrate with the above example is that the sample mean is also a 
random variable, whose distribution depends on the distribution from which we draw the 
random sample. Accordingly, the value we observe for the sample mean varies with the 
sample we actually draw. Each different sample yields a distinct value for the sample mean. 
Needless to say, this holds for any function g(X) = g(X\, . . . ,Xn) of the sample, which 
we call sample statistic. To conduct inference, we must always bear in mind that a sample 
statistic is random and hence we must determine its distribution, which we call sampling 
distribution. 

Example: Let X = (Xi, . . . ,Xn) denote a random sample from a normal distribution 
with mean \i and variance a 2 , i.e., Xi ~ iid A/"(/x, a 2 ). The sample mean X^ = A $2i=i ^ * s 
a random variable with expected value 



/ N \ N N 

\ i=l / t=l i=l 



q (Nfi) = li 



and variance 



/ N \ N N 

var(X^) = var - £ X t )= ^ £ var(X 4 ) = — £ a 2 = — (Na 2 ) = a 2 /N. 

\ i=l / i=l i=l 

Finally, as the sample mean is a linear combination of normal random variables, it is also 

normal. We thus conclude that Xn ~ A/"(/i, a 2 /N), which differs from the normal distribution 

from which we draw the sample. 
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In the next section, we will show that, as long as the sample size is large enough, normality 
is often a very good approximation for most sampling distributions, even if the distribution 
from which we draw the sample is not normal (and even unknown). The next example 
concludes this section by depicting such a situation. 



Example: The number of transactions on the Macau stock market is on average of 62,000 
trades per week with a standard deviation of 7,000. Suppose now that we take note of the 
number of transactions per week within a year. The expected value of the sample mean is 
E(X N ) = 62, 000 trades per week, with a variance of vai{X N ) = 7, 000 2 /52 = 942, 307.69 
given that there are 52 weeks in a year. The normal approximation for the sample mean 
distribution yields a probability of observing a sample mean at least two standard deviations 



away from its mean, i.e., Pr 



Xiv-62,000 



M12.30V.W) 



> 2 ) , of about 5%. 
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5.2 Large-sample theory 

In this section, we will discuss how to approximate the sampling distribution of a statistic in 
general. In particular, we will talk about asymptotic approximations in that we will let the 
sample size grow to infinity. Although these results hold only in the limit, we will see that 
they often provide a good guidance to the behavior of most statistics as long as the sample 
size is large enough. What is 'large enough' will of course depend on the task at hand. If we 
employ asymptotic results to approximate the distribution of the mean of a random sample 
coming from a uniform distribution, large enough could mean even 5 observations. 

We first discuss what we mean by limit theory in the context of random variables. In 
particular, we establish several modes of convergence for sequences of random variables. 
Next, we establish some convergence results for sample means given that they play a major 
role in statistics. There are two asymptotic results that permeate almost any problem of 
statistical inference: Laws of large numbers say that sample means converge (in some sense 
that we will precise in the next section) to the population mean, whereas central limit 
theorems single out the conditions under which we can approximate the distribution of a 
sample mean with a normal distribution. 

5.2.1 Modes of convergence 

Let X\,X2, . . . denote a sequence of random variables, which we denote simply by Xn despite 

of the abuse of notation. We say that Xn converges in probability to a constant a, which 

we denote by X N — ► a, if liniTv^oo Pr(|-Xjv — a| < e) = 1 for any e > 0. We call a the 

probability limit of X^, which we denote by plim^^^ATv = a. The probability limit is a 

natural generalization of the mathematical notion of a limit. A stronger mode of convergence 

follows if we switch the order of the limit and probability operators. Indeed, it is much 

more stringent to impose that Pr (liniTv^oo ^n = a) = 1 for it constrains the function that 

the random variables in the sequence X^ use to map the sample space to the real line. 

This is what we call almost sure convergence or, equivalently, convergence with probability 
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one, which we denote by Xx — ^ a. Mean squared convergence denotes a situation in 
which a sequence Xx of random variables converge in mean square to a constant a in that 
liniAr^oo E(Xv — a) 2 = 0. We then say that a is the mean square limit of the sequence Xx 
and write Xx — '-* a. 

These modes of convergence admit various extensions. For instance, we can consider the 

fTTLC fTYlC 

convergence to a random variable by writing that Xx : — ► X if and only if Xx — X — ► 0, 
where fmc G {p,a.s.,m.s} denotes your favorite mode of convergence. In addition, we 
can also think of sequences of random vectors (or matrices) rather than sequences of scalar 
random variables. In this event, it suffices to apply your favorite mode of convergence 
element-wise. 

The modes of convergence we have seen so far are pretty strong in that the sequence of 
random variables in the limit becomes degenerate given that it converges to a simple constant 
(without any form of randomness) . In contrast, convergence in distribution deals with limit- 
ing distributions. We say that Xx converges in distribution to X if liniAr^oo Fx N (x) = Fx(x) 
for any x e R. We call Fx the asymptotic (or limiting) distribution of Xx and denote this 
mode of convergence by X N — ► X or X N — ► F x . 

Convergence in distribution is the weakest of the convergence definitions in that we con- 
strain only the distribution of the random variable and not the values it take. In particular, 
it is possible to show that 

v a.s. v | 
-A-N > -A I v d 

x N ^x\ 

As before, we can also extend the notion of convergence in distribution to a multivariate 
setting (i.e., to random vectors). 

Finally, we conclude this section by showing how to manipulate and combine the different 
modes of convergence. We first note that g(X^) — > g{X) if X^ — > X and that g(X^) — > 
g(X) if Xx — > X provided that g(-) is a continuous function. This pair of results is known 

as the continuous mapping theorem. The Slutsky theorem is one of the various applications 
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of the continuous mapping theorem, ensuring not only that X N + Y N — > X + a if X N — > X 
and Y N — > a, but also that X N Y N — > if X N — > X and Y N — > 0. Finally, it is easy to 
see that An Xn — > A X if Xn — ► X and An — > A, as well. 

Example: Let Xn denote a fc-dimensional random vector such that Xn — ► H and 
\/N(Xn — ix) — ► Z, where ^ is a vector of constants and Z is a random vector with 
some known distribution. Let now <*(•) : R fc i— > W denote a multivariate function with 
continuous first-order derivatives. It then follows from a simple first-order Taylor expansion 
that vAM ck(Xjv) — cx(fjt) ) — ► A Z, where A = " , . To appreciate why, note that the 
first-order Taylor expansion is ck(Xat) = ct(/ji) + q, (Xn — /-*) with /x^ = A/U+ (1 — X)Xn 
for some A in the unit interval. Multiplying both sides by yN then yields the result given 
that n^ — ► n for any A provided that X^ — ► [i. 

The above example illustrates the delta method, which consists of a useful tool to derive 
the asymptotic distribution of a known function of a statistic. It is interesting to note that, 
even though the example assumes that Xn converges in probability to a vector of constants, 
it suffices to multiply by yN to avoid the convergence in probability and obtain convergence 
in distribution. The next section discusses more thoroughly the conditions under which this 
may happen in the context of sample means. 

5.2.2 Limit theory for sample means 

We can write most statistics of interest as sample means. For instance, the sample variance 
is the sample mean of the squared deviations with respect to the mean value in the sample. 
This means that we can learn a lot about the asymptotic behavior of most statistics if 
we understand well what happens with a generic sample mean in large samples. In what 
follows, we will discuss two sorts of asymptotic results: Laws of large numbers (LLN) handle 
convergence in probability and almost sure convergence, whereas central limit theorems 
(CLT) deal with convergence in distribution for sample means. 
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We start with Chebyshev's weak law of large numbers, which posits that Xn = jr Si=i -^-i — 
(a if liniTv^oo E(Xtv) = A* an d um iV->-oo var(Xjv) = 0. It is easy to see why this result hold by 
noting that these moment conditions essentially ensure convergence in mean square, which 
in turn implies convergence in probability. Also, the above result does not require a random 
sampling given that we are not necessarily imposing that E(JQ) is constant for i — 1, . . . , N. 
If we compute the sample mean of a random sample, it then suffices to assume that E \X\ < oo 
to obtain X^ — ► \x = E(X). We will refer to this result as Khintchine's weak law of large 
numbers for random samples. By imposing a slightly more stringent condition, it is also 
possible to show that the sample mean almost surely converge to the true mean of a random 
sample (also known as Kolmogorov's strong law of large numbers). 

Finally, we tackle the asymptotic distribution of a sample mean by means of Lindeberg- 
Levy central limit theorem, which says that yN{X^ — /i) — ► A/"(0, a 2 ) as long as X^ is iid 
with E \Xi\ < oo. To sum up, a sample mean X^ of iid random variables with finite mean 
and variance is such that X^ -^A fi (due to LLN) and yN{X^ — ji) — ► A/"(0, a 2 ) (due to 
CLT). 

There is a whole bunch of different laws of large numbers and central limit theorems that 
deal with all sorts of settings. Extensions include, among others, LLN and CLT for random 
variables that are dependent and/or non-identically distributed as well as for random vectors 
and matrices. Although there is not much gain in showing/memorizing all the different 
conditions under which we can derive either a LLN or a CLT, it is important to know that 
such results exist for a wide array of situations. 

Example: Let us revisit the last example of Section 5.1. Assuming that the number of 

transactions per week on the Macau stock exchange is iid over time, with some finite mean 

and variance, ensures that the normal approximation for the distribution of the sample 

mean will work well in practice. However, it is sort of risky to assume that the number of 

transactions per week is independent over time. A simple inspection of a time series plot 
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for such a series would typically indicate that market activity cluster over time. Although 
this does not affect the application of Chebyshev's weak LLN, we must consider a CLT 
that allows for dependence over time. There are indeed several central limit theorems that 
relax the assumption of independence (which we will not review here given that they impose 
primitive conditions that restrict the dependence between observations in quite complicated 
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Chapter 6 

Point and interval estimation 



An estimator is a statistic that we employ to estimate an unknown population parameter 
such as, for example, a population mean. It is also a random variable in that its value 
depends on the particular realization of the sample. A parameter estimate then is the value 
that we observe for an estimator given the sample. The next example illustrates the fact 
that we can always think of many different estimators for a given quantity. 

Example: Let X denote a Gaussian random variable with mean 1/3 and unknown 

variance, that is, X ~ A/"(l/3, a 2 ). Suppose that we draw a random sample of 5 observations 
with values: X\ — 3, x% — 4, X3 — ^, x± — 3, and £5 = |. We could estimate the population 
mean by any of the following estimators: 

1. jui = Xi, which produces an estimate of |; 

2. ju 2 = Xl ~! 2 X4 , which entails an estimate of |; 

3. JU3 = X 2 , giving way to an estimate of |; and 

4. J14 = X$ = I J2t=i -%-i, leading to an estimate of ^. 

Needless to say, the list above is not exhaustive at all and many other estimators exist 
for the population mean. That's exactly why we must come up with some criteria to choose 
between estimators. Sections 6.1.1 to 6.1.3 discuss some intuitive criteria, whereas Sections 
6.1.4 and 6.1.5 describe the two most popular estimation methods in statistics. Regardless of 
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the method we employ, the usual estimation procedure involves three steps. First, we must 
draw/observe a random sample from the population of interest. Second, we must calculate 
a point estimate of the parameter. By point estimation, we mean assigning a unique value 
to the estimate as opposed to an interval of possible values as in interval estimation. Third, 
we must compute a measure of variability for the estimator that accounts for the sampling 
variation in the data. This often takes the form of computing confidence intervals. 

In what follows, we first discuss point estimation and then turn attention to the more 
interesting problem of interval estimation. We say interval estimation is more interesting for 
it allows us to bridge the two main strands of statistical inference, namely, estimation and 
hypothesis testing. 

6.1 Point estimation 

We denote by On a point estimator of based on a sample of N observations, though we will 
sometimes omit the dependence on the sample size to simplify notation. We next define in 
a more formal manner what we mean by point estimator. 

Consider a sample X^ ' = (X l7 . . . ,Xn) of N random variables that we denote by Xj 
for i = 1, . . . , N. We know that a statistic is a function of the sample X' ' and we define 
a point estimator as a statistic On = 0(X^ ') = 0(X\, . . . , Xn) that we employ to infer the 
value of a parameter of the joint distribution of X^- ' = (Xi, . . . ,Xn)- The parameter 
estimate is the realization of the estimator, that is to say, On = 0(x\,. . . ,xn)- Note the 
abuse of notation in that we refer to both estimator and estimate as On- 

There are several possible estimators for any parameter and hence we will discuss in 
what follows the sort of properties we would like our estimator to hold. In particular, we will 
start with the definition of the mean squared error of a given estimator so as to motivate 
the discussion about unbiasedness, consistency, and efficiency. 
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6.1.1 Mean squared error 

As there are many candidate estimators for a given population parameter, it seems paramount 
to rank them through some measure of precision. The most popular measure is the mean 
squared error, which gauges the average distance of the estimator to the true parameter 
value by means of a quadratic distance. For a sample X^ ', the error of the estimator 
9 N = 9{X^ ') is given by On — 9. Note that the estimation error depends not only on the 
estimator, but also on the particular sample we observe. Different samples will give distinct 
point estimates for the population parameter of interest. 



Definition: The mean squared error of On is MSE(0tv, ' 



E( 



'N 



The intuition for a measure such as the MSE is straightforward. We square the estima- 
tion error before taking averages in order to avoid negative values canceling out with positive 
values. It thus measure how far, on average, the set of estimates are from the population pa- 
rameter of interest. The nicest thing about the mean squared error is that we can decompose 
it into two readily interpretable components: 



MSE(0jv,0)=E( 



E 



'N 



9 N -E{ 



E 



>N> 



On - E(6 N ) 

2 



E(0 N ) - 9 

-i 2 



E 



N -E( 



var(0 N ) + 
var(07v) + 



>N 



E( 



E( 



'JVJ 



1 N 



E 







E( 



E( 



>N 



>N 



2E 



9 N -E{ 



9n — E(0jv 
2e\o n -E(O n ) 

y \e(9 N ) - 



E(9 N ) - 



This decomposition clarifies that the mean squared error is the sum of the variance of the 
estimator and the squared bias of the estimator. The variance indicates how far, on average, 
the set of estimates are from their expected value and hence it is a measure of precision of 
the estimator. We denote the square root of the variance of an estimator by standard error 
(rather than standard deviation). In contrast, we define the bias of 9n as E(9n) — 9, that 
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is to say, the difference between the average estimate and 9 or, equivalently, the average 
estimation error. The bias thus measures how accurate is the estimator. We will later show 
that there is a bias- variance tradeoff, which translates into a trade-off between accuracy and 
precision. 

We say that 9 is an unbiased estimator of 9 if and only if E(#/y) = 9. The bias is a 
property of the estimator, not of the estimate. Ideally, we would like to have the most 
accurate and precise estimator. The next definition formalizes this idea by setting up the 
MSE criterion for choosing between estimators. 

Definition: Let G denote the parameter space that collects all possible values for the 
population parameter 9. Consider two estimators 9 X and 9 2 for 9. We say that 9 1 is more 
efficient than 9 2 if MSE(0i,0) < MSE(0 2 ,0) for every 9 E 6 and MSE(9 U 9) < MSE(0 2 ,0) 
for at least one value of 9 G 6. 
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Example: Let X^- ' = (X l7 ...,Xjv) denote a random sample of iid J\f(fx,a 2 ), whose 



1 v-Af 



^2 



parameters we estimate by means of fl^ — -h Y^i=t -^ and a 



1 v-iV 



N - N - 



lEi=l(^i ~~ Vn) 



respectively. Both estimators are unbiased in that E(jujv) = jj J2i=i E(X) = \i and that 



E(a 2 



N) 



E 



E 



E 
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l N i r i N 

^-^(X,-/^) 2 =E -^—^{X 2 -2X^ N + ^ 2 N ) 

i=l i=\ 



X 
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X 



1 / N N N \ 

^ ff]X 2 - 2/2^ (iV^) + iV/2 2 , j 



JV 



T [iVE(Xf)-iVE(/2 2 ,)] 



iV-1 

iV-1 



e(E^J-^e^) 



(/i 2 + a 2 )- /, 2 + 



0~~ 

iV 



^M'"' 



as the second uncentered moment of any random variable is the sum of the square of the 
first moment and the variance, i.e., E(X 2 ) = [E(X)] 2 + var(X). See below for the derivation 
of the variance of jujv- Unbiasedness means that the mean squared error of /2jv and a 2 N are 
just their variance. The variance of ju is 



(l N \ \l N 

E((2 N -fj,) 2 = E l-^Xi-fi) =E -^(Xi-fM) 

V i=l / i=l 



E 
1 



± J2 (Xi-MXj-n) =^ E E[(*i-A0(*i-A0] 



iV 



^var(X,) + 2 ^ caviX^Xj) 

l<i<j<n 



j=l 



(7 

iV' 



given that independence ensures that cov(Xj,Xj) = for all 1 < i ^ j < X. It is also 
possible to show that, under normality, the variance of a% is given by 

2 



2\2 



E (d% - a 2 ) 



X- 1 



a 



We require normality for it ensures that the fourth moment depends only on a 2 . Consider 

now the following alternative variance estimator: a 2 N = ■h ^ i=1 (Xj — JIn) 2 - This estimator 
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is obviously biased in that E(S%) = IE (^^ &%) = ^j^ a 2 . As for the variance, we employ 
the same trick to show that 

r2 , [N-l„ 2 \ (N-l\ 2 2 

var^jv) = var I ~j^°n J = I — j^~ J ™r{a N ) 

N-l\ 2 2 4 2(iV-l) 4 

O" = 7T^ CT • 



N N-l N 2 



This means that the mean squared error of a 2 N is 

A/rqpr^ ^ 2(iV-l) /^V-l 2 2 X 

2(N-1) + 1\ 4 2JV-1 

c = — ttt; — cr , 



2(N-1) 1 

AT2 + Jp 



a 1 



N 2 J N 2 



which is strictly inferior to the mean squared error of a 2 N . In the MSE sense, it then follows 
that a 2 N is superior to <j 2 n as an estimator of a 2 . 

6.1.2 Unbiasedness 

Although it seems very reasonable to compare estimators purely on the basis of the mean 
squared error, it is very often the case that there exists no best estimator. The reason is that 
the class of all possible estimators is too large. One way to make the selection of estimators 
tractable is to restrict the search to a specific class of estimators. A natural choice is the 
collection of all unbiased estimators. 

Definition: An estimator 9^ is a best unbiased estimator of 9 if E(#/v) = 9 for all 9 £ G 
and var(#jv) < var(#7v) for any other unbiased estimator 9^ such that E(#jv) = 9. 

The above definition does not help us much in the sense that it does not provides much 
information about the best unbiased estimator. The next result adds to this discussion by 
establishing a lower bound for the variance of any estimator. It is known as the Cramer-Rao 
inequality. 

Theorem: Let X' ' = (Xi, . . . ,X^) denote a random vector with joint probability 
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density function f(X^ N '; 0). Consider an estimator On of such that 

and var(6 l ) < oo. It then follows that 

mm 



var(6») > 



E 



|ln/(XW; 



2 ' 



It is easy to see that the numerator of the right-hand side of the above inequality is equal 
to one if the estimator is unbiased, whereas the denominator depends on the expected value 
of the first derivative of the logarithm of the joint density function. We call the latter the 
score function, which will play a key role in Section 6.1.5. 

6.1.3 Consistency 

Although it makes sense to impose unbiasedness, there are biased estimators that achieve 
better mean squared errors as we have seen in the example of Section 6.1.1. The most 
important is to nail the parameter down as the sample size increases. In that example, 
for instance, a N is asymptotically unbiased in that the bias shrinks to zero as the sample 
size N goes to infinity. As the Nobel prize winner Clive Granger once said, "If you cannot 
get it right as the sample size grows to infinity, you shouldn't be in the business" . The next 
definition formalizes this notion by introducing the concept of consistent estimators. 

Definition: Let denote the parameter space and {On] N > 1} denote a sequence of 
estimators of the population parameter e indexed by the sample size N. In particular, 
let On denote an estimator based on the first N observations of a sample (Xi, X 2 , . . .) from 
a given probability distribution f(x; 0). The sequence On is (weakly) consistent on if and 
only if, for all E and for all e > 0, it holds that liniTv^oo Pr(\0 N — 0\ > e) = 0. 

So, a consistent estimator converges in probability to the true value of the population 

parameter. There is no problem in using a different notion of convergence. If we are talking 
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about convergence in mean square then we say the estimator is consistent in mean square. 
Similarly, a strongly consistent estimator converges almost surely to the true value of the 
parameter. 



Example: Suppose that Xi,X2,... is a sequence of random variables drawn from a 
A/"(/i, a 2 ) distribution. To estimate \i based on the first N observations, we usually use the 
sample mean 9^ = {X\ + . . . + X^)/N, which is an unbiased estimator with variance of 
a 2 /N. Given that linear combinations of normal variates are also normal, 9 n is normal with 
mean /i and variance a 2 /N and hence \/N (9n — ^)/°~ has a standard normal distribution. 
It then follows that 



Pr(#v - (J> > e) = Pr 



f 6 -^^>T/N £ 



a 



a 



1 - $ [ VN - ) ->■ 

(7/ 



as N tends to infinity, for any fixed e > 0. Similarly, Pr(8^ — \i < —e) — > 0. Therefore, the 
sequence 9 of sample mean is consistent for the population mean /i. 
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In the above example, it is easy to prove consistency due to the normality assumption. 
In general problem, it is not easy to find the exact distribution of On and hence establishing 
consistency in a direct manner becomes intractable. That's why most proofs of consistency 
are based on the Chebychev inequality. Applying the latter to an estimator 6 N yields 

Pr(\e N -9\>e)<^^. 

e z 

As a consequence, it suffices to show that the mean squared error of On converges to zero (or, 
equivalently, that On is asymptotically unbiased and its variance shrinks to zero). Another 
useful result to prove consistency is the continuous mapping theorem, which dictates that, 
if On is consistent for 0, then q(0n) is consistent for g(0) if g(-) is a continuous real-valued 
function. 

In the next sections, we discuss two methods for deriving consistent estimators of a 
population parameter. The first is the method of moments, which impose assumptions only 
on the moments of the distribution. This means that it does not require us to know the 
joint distribution of the data, only the moments. In contrast, the second method requires 
the specification of the joint distribution and, as such, it takes advantage of the whole 
probabilistic structure to derive efficient estimators for the population parameters. 

6.1.4 Method of moments 

The method of moments relies on the very simple idea of matching sample moments with 
their population counterparts. Let (X\, . . . , Xn) denote a random sample with marginal 
probability density functions given by f(Xi] 6) with 6 = (9\, . . . , Ok)'. Assume that the first 
k moments of Xi exist, that is to say, E(X*) = /!&(#) < oo. The moments naturally depend 
on the parameter vector 8 = ($i, . . . , Ok)', and so we can equate the first k sample moments 
to the corresponding k population moments to solve for (Oi, ... ,0k): 

N N N 
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The above forms a system of k equations with k incognita, and hence it suffices to solve for 
($i, . . . , 9k) as a function of the sample moments. 

Examples 

(1) Consider a random sample (X 1; . . . , X^) from a normal distribution with mean /i and 
variance a 2 . Using the method of moments, we equate the first two sample and population 
moments giving way to /2at = -^ Yli=i -^ anc ^ F J2i=i X? = ju^ + a 2 N . The latter leads to 



a 



2 _ i_ST- N V2 Tr2 _ 1 V^^ (V Tr ^2 

N — N 



V S*=l "^ ~^N - N 52i=l( X i ~ V'nY 



(2) Suppose that (X l5 . . . , Xjv) is a random sample drawn from an exponential distribution 
with parameter A. The methods of moments would suggest employing the sample mean as 
an estimator of A. 

(3) Suppose that (X±, . . . ,Xn) is a random sample from a negative binomial distribution 
with parameters k and p. The methods of moments lands us with -^ Y2i=i -%-i = k/p and 
Tj J2i=i X? = k{\ — p)/p 2 . Solving for k and p then yields 

6.1.5 Maximum likelihood 

The method of moments uses only a subset of the joint distribution moments and hence 
it cannot be as efficient as an estimator that exploits the whole information given by the 
joint distribution. The price to pay for the latter is that it is easier to misspecify the joint 
distribution than a couple of moments. For instance, economic theory speaks only about 
expectations (and hence moments), without much to say about distributions (unless as a 
simplifying assumption to make the model tractable). 

Consider a ransom sample (X 1; . . . , X^) drawn from a probability density function given 
by f(Xi\ 0), where 6 = (#i, . . . , 6k)' ■ We define the likelihood function as 



C{0-X) = \\f{X,-0) 



t=l 
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In other words, the likelihood function is equivalent to the joint density of the data but for 
the argument. While the joint density is a function of the random variables (X±, . . . ,Xn) 
given a parameter vector 0, the likelihood inverts the problem and considers a function of 
the parameter vector given the sample we observe (Xi, . . . , Xn)- 

The maximum likelihood (ML) estimator On then searches for the parameter value that 
maximize the probability of observing the sample (X\, . . . , Xn), resulting in 

On = argmax C(0; X) = argmax 1 [ /(X$; 0). 
oe® 0e& ~7 

In view that monotone transformation do not alter the maximization problem, we prefer to 
take the logarithm of the likelihood function so as to end up with a sum of log-densities for 
it is always easier to manipulate sums rather than products. This yields 

JV 

On = argmax \nC(0;X) = argmax N ln/pQ;0). 
ee® oe® , 

So, to find the ML estimator, we must equate the score vector J^ ln/(_X*; 0) to zero. 

The maximum likelihood method entails a number of interesting properties for the esti- 
mator. First, it is invariant to parameter transformations in that, if is the ML estimator of 
0, then the ML estimator of g{0) is g(0) for any function g of 0. Second, the ML estimator 
is very easy to deal with in large samples for it is asymptotically normal. Third, under cer- 
tain weak regularity conditions, the ML estimator is asymptotically unbiased and efficient 
in that it achieves the lower bound given by the Cramer-Rao inequality as the sample size 
goes to infinity. 

Examples 

(1) Consider a random sample (X\, . . . , Xn) from a binomial model with probability p. The 
likelihood is C(p; X) = Y2t=i {x)P Xi ^ ~ p) N ~ Xi and hence the score function is 

op p 1 — p 

Equating the score function to zero yields Pn = jj J2i=i Xi- 

(2) Suppose that (X\, . . . , Xn) is a random sample drawn from a Poisson distribution with 
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arrival rate A. The first-order condition for the maximization of the log-likelihood function 
then is 



d_ 
dX 



\n£(X;X] 



N 



A=A» 



d ^, /e~ A A x * 



d\ 
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OX 
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- NX + In A J2 x i ~ Yl ln ( Xi 



i=i 



i=l 



^^ ! 



££i* 



A=A. 



A 



-iV = 0, 



A ; 



giving way to Ajy = -^ Yli=i -%-i as the maximum likelihood estimator. 

(3) Suppose that {X\, . . . ,Xn) is a random sample from a normal distribution with mean 

fj, and variance a 2 . The first-order conditions are S- ln£(/i, a 2 ; X) = -^ ^ i=1 {Xi — /i) and 

■7^2 ln£(/i, a 2 ] X) = — 2^2 + 2^4 ^i=i{Xi — /i) 2 yielding the following maximum likelihood 

estimators for the mean and variance: /i^v = ^ Ei=i ^ an< ^ °n = JiJ2i=i( X i ~ R/v) 2 > 

respectively 




Download free eBooks at bookboon.com 



118 



^ 



Click on the ad to read more 



Statistics for Business and Economics Point and interval estimation 

To show that the maximum likelihood estimator is consistent and efficient, we must 
impose some regularity conditions. First, maximization of the likelihood function is over a 
compact parameter space Ocl 4 and the true parameter vector is in the interior of the 
parameter space. Second, the average log-likelihood function sn(0) = ^^i=i m ^(^5 -^-d 
converges almost surely to its expected value for all possible parameter values, that is to say, 
Sjv(0) — — > ^0 o [sn{6)] = SooiO.Oo) for every G 0, where the expectation is taken over the 
joint distribution function evaluated at the true parameter value 0q £ ©• Third, sn{0) is 
continuous in G 0, and hence the sample holds for Ee o [sjv(0)]. Fourth, the latter has a 
unique maximum in G 0. 

We are now ready to show that 0^ converges almost surely to the true value 0q of 
the population parameter vector. We start by noting that 0n for sure exists given that a 
continuous function always has a maximum in a compact set. Second, for any ^ do, it 
follows that E 0O [ln£{0;X) - ln£(0 o ;X)] < lnE 0() [£(0; X)/£(0 O ; X)] due to the Jensen's 
inequality as the logarithmic function is concave. However, 

r°° r(0- x ) 
E 9o [£(0; X)/£(0 O ; X)] = / -^^r £(0 O ; x) dx = 1 

and hence Ee [In £(6; X) — ln£(^o; X)] < 0. We now first divide both sides by N to yield 
E0 O [sn(0) — sn(Oo)] < and then take limits to obtain Soo(0, O ) < s^Oq, Oq) almost surely 
by the uniform convergence assumption. Further, the identification assumption ensures that 
the inequality is strict if 9 ^ 9 , whereas s 00 (6 , at, o ) > s^fto, O ) by construction given that 
0n maximizes the average log-likelihood. Altogether, this means that 0^ — — * ^o, proving 
strong consistency. 

The weak consistency of the maximum likelihood estimator is much easier in that it 
suffices to show that the mean of the ML estimator converges to the true value of the 
parameter, while its variance shrinks to zero. The regularity conditions we must impose to 
ensure weak consistency (i.e., convergence in probability) are indeed much milder than what 

we assume to achieve strong consistency (i.e., almost sure convergence). In what follows, we 
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derive the asymptotic mean and variance of the ML estimator and then derive its asymptotic 
normality under the assumption that sn(0) is twice continuously differentiable. 

We first take the derivative of sn(0) with respect to to find the score function, which 
we then equate to zero to find maximum of the log-likelihood function, namely, 

8 _ 1 A d 



7 s N (0) = ^-J2--\nC(0;X l ) = O. 



90' "^ N ^ 80 

i=l 

Although we denote the score function by -^ sn(9), note that it depends on X and hence 
it is also a random vector with the same dimension as the vector of parameters. The first 
step of the proof is to show that the score function is on average zero for any 0£0. This 
is indeed the case as long as ^ C(0; Xi) is bounded, and so we can switch the order of 
differentiation and integration 



E, 



^7 In £(*;*,) 



00 8 f°° — C(0- x) 

7 ln£(0;x i )£(0;x i )dx i = / de '^ - *' £(0;x l )dx i 



OO 



80 



[°° 8 8 [°° 8 

I —jC(0]x i )<lxi = — i J C(0;xi)dxi =-- ^^1 =0. 



We next apply a Taylor expansion to the score function evaluated at the ML estimator, 
which is equal to zero given the first-order condition: 

= A C(0 N ; X) = ^ C(0 O ; X) + JL^ £(0«; X) (0 N - O ), 

where 0* G [do, On]- It is straightforward to show that He t = d Q de i £{0*] X) is invertible 

and thus y/N (0 n - Oo) = -U£VN & £(0 O ; X). Note that He. = N Eix ok& C ( *^ X i) 

and hence it does satisfy a strong law of large numbers provided that the variance of 

d g d0 , £(#*; Xi) is finite. In addition, we know that the ML estimator is strongly consistent in 

that converges almost surely to 0q, implying that 0* converges as well to 0q. Altogether, 

this means that the random matrix Ho t converges to E 0O -^ J2 i=1 d Q de / £(Oo', X{) < oo 

almost surely. 

It now remains to study the asymptotic behavior of vN J^ C(0o', X). In view that we 

have already seen that the latter has mean zero for any G 6, it is reasonable to assume there 
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is a central limit theorem that applies. Letting X OO (0 O ) = limAr-»oo var ( V^V -^ C(6 ; X 
then yields X~ 1 (^o) V^V jj^ C(6 ; X) — ► A/"(0, Ik), where Ik is a fc-dimensional identity ma- 
trix and Xoo(0o) is known as the information matrix. 1 Combining all of the above ingredients 
gives way to 

Vn(b n -e )^M (o,H^(0o)ioo(e o )H^(e o )) . 

To demonstrate that the variance of the ML estimator achieves the Cramer- Rao lower bound 
and hence it is efficient, it suffices to show that X OO (0 O ) = — ^CoH^o)- To do so, we first 
differentiate both sides of Jf^ -^ ln£(0; xi) C(0; xi) dxj = with respect to 6, yielding 
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given that the scores jj| hx£{0]xi) and ^ ln£(0;Xj) are uncorrected for 1 < i 7^ j < iV. 
Finally, taking limits yield the result. 

6.2 Interval estimation 

It is obviously useful to know the expected value of a random variable, but ideally we must 
also have some idea of variability. For instance, it is always good news to hear that one of 
our investments is giving on average 10% of return per month. However, it is even more 
comforting to know that it has an expected return of 10% ±1%. The idea is to derive a 



1 Although we have spoken about joint distributions, we have not explicitly introduced any multivariate 
distribution. A multivariate normal distribution depends on a vector of means fi and a covariance matrix X. 
The elements of \x correspond to the individuals means, whereas the elements in the main diagonal of X refer 
to the individual variances. The off-diagonal elements of X denote the covariance between the components 
of the random vector. So, if a random vector X = (X\, . . . ,Xk) is multivariate normal with a covariance 
matrix given by the fc-dimcnsional identity matrix //., then Xi and Xj are independent for any i =/= j. In 
general, if X ~ A/"(/x, X), then a vector of linear combinations of the elements of X is also multivariate 
normal, namely, AX ~ j\f( Afx, ASA') . 
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measure of precision from the sampling distribution of the sample mean (or any other point 
estimator). In this way, we may provide a range of values within which we expect the mean 
of the distribution to belong rather than give only a point estimate. 

Examples 

(1) Let X ~ A/"(/i, a 2 ) with a 2 known. We estimate p, by means of the sample mean Xn, 
whose sampling distribution is Xn ~ A/"(/i, a 2 /N). To establish a confidence interval around 
the sample mean in which the true value of /x belongs with 95% of probability, we first note 
that \fN{X N — p)/a is standard normal and hence Pr f \/N \X n — //| J a < 1.96 J = 95%. It 



then follows that 

Xn — H 



Pr VN 



a 



Pr I X N - 1.96 -^= < a < X N + 1, 96^= = 0, 95 

ViV vn' 



This result holds exactly only because of normality and because we know the value of the 
variance. In large samples, the above confidence interval nevertheless provides a good ap- 
proximation because the central limit theorem and the weak law of large numbers ensure that 
the sampling distribution converges to a normal distribution, whereas the sample variance 
converges in probability to the true variance as N — > oo, respectively. 

(2) Let X = (Xi, . . . , X n ) denote a random sample of Bernoulli essays that take value one 
with probability p, zero otherwise. To estimate the probability p, a natural estimator is the 
relative frequency Pn = j? J2i=i -%i, which is unbiased given that 



1 N 1 



N^ y ' N 

i=l 



This means that the mean squared error of the relative frequency corresponds to its variance, 
which is given by 



E (pn -pf=E 



1 N 



«=i 



^E(X-p) 2 = ^var(X) = ^p(l-p). 



Note that the variance shrinks to zero as iV — > oo, confirming that the relative frequency is 

a consistent estimator in that it converges in probability to the probability p. The central 
Download free eBooks at bookboon.com 

122 



Statistics for Business and Economics 



Point and interval estimation 



limit theorem says that the binomial converges to a normal distribution as the sample size 
grows. Given that Xw=i -%-i has by definition a binomial distribution with expected mean 
Np and variance Np(l — p), it follows that pn weakly converges to a normal distribution 
with mean p and variance p (1 —p)/N. Accordingly, the probability p belongs to the interval 



Pn 



1.96y/p N (l-p N )/N,p N + 1.96y/p N (l-p N )/N 



with 95% of confidence. As before, we are using not only the central limit theorem to justify 
asymptotic normality, but also the law of large numbers to ensure that we estimate the 
variance consistently by plugging in p^ in lieu of p. 
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It is interesting to review what happens under normality just to fix some ideas. So, 
let Xi ~ iid Af(ji, c 2 ) for i = 1, . . . , N. We know that the sample mean Xn has expected 
value /i and variance a 2 /N. In addition, as it is a linear combination of normal variates, 
X N ~ A/"(/x, (T 2 /N) or, equivalently, yN (X — fi)/a ~ A/"(0, 1). The snag is that, in general, 
the standard deviation a is unknown and hence we can at best replace it with a consistent 
estimator. For instance, the maximum likelihood estimator of the variance of a normal 
distribution is 

1 N 1 r 

^ N= N E( Xi ~ Xn ^ = Jj £ (^ ~ 2X ^ + x n) 

i=l i=l 



iV iV N 

^J2x!-2X N -J2x i + X 2 N = --J2x!-2X 2 N + X, 



N ^ l #^ JV N 

i=l i=l i=l 



iV 

N 



v5Z X * 2- ^' 



i=l 



with an expected value of 



E{o%) =E ( ^fx, 2 -X^ ] = E(X 2 ) -E(X; 



a^ / , -- \-*-N) 

t=l 

2,2 2 2 /at- 2/1 i /7\r\ i V — 1 2 Vtoo 2 

= n + a — ii — a /N = a (1 — 1/iV) = — — — a — > a . 

Although it is asymptotically unbiased, the ML estimator is biased in small samples. As we 
know, to find an unbiased estimator for the variance of a normal distribution, it suffices to 
multiply the ML estimator by a factor of N/(N — 1), giving way to 



N N 

*» '= a737^ = Jj-Zi^ X * - Xn ^ 

i=l 



1 N 



Intuitively, we subtract one from the denominator because we have already lost one degree 
of freedom due to the estimation of the mean. 

The nice thing about assuming normality is that it allows us to say something about the 
distribution of a 2 N and hence about the exact distribution of the sample mean. To appreciate 
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that, we first note that 



~2 



N 



N 



l — J2(X t - X N f = —^ J2 [(X, - pt) - (X N - ^)] 2 

i=l i=l 

1 - 

— J2 [( X i - tf - 2 (^ - /*)(** -!*) + (Xn - /i) 2 ] 
=i 

N N N 

£(X, - /i) 2 - 2^T(X, - p)(X s -fi) + Y,(Xn - aO s 

i=l i=l i=l 

N 

Y^(X t - /i) 2 - 2N(X N - /i) 2 + N(X N - fif 

i=l 

N 

J2(X t -fi) 2 -N(X N -fif 



N -1 

1 
JV-1 

1 



N-l 



i=l 

N 






7V-1 ^ v r/ JV-i 

Dividing both sides by the variance then yields 

' Xi - n 



(iv-i)^f 



a" 



E 

i=l 

AT 

E 



a 



A/ 



X N - fj, 



a 



Xi — /A / ATjv — A 1 



a 



WViv 

Both terms within brackets refer to squared standard normal random variables and hence 

have chi-square distributions. Moreover, it is possible to show that (N — 1) o 2 N ja 2 is a chi- 

square random variable with N — 1 degrees of freedom. Bearing that in mind, we now turn 

attention to 

X N - fi VN{X N - fi)/a 



N 

a N cr N /a 

whose distribution is t-student with N — 1 degrees of freedom given that the numerator is 
standard normal and the denominator is a chi-square divided by its degrees of freedom. 2 
Altogether, this means that, under normality, we can construct an exact confidence interval 
for the mean value of the distribution. In particular, it follows that 

X N — /i 



Pr 



N 



a N 



< *jv-i(1 -a/2) =l-a 



2 It is also possible to show that, under normality, the random variables in the numerator and denominator 
are independent. 
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where £jv-i(l — a /2) is the (1 — a/2) percentile of a t-student with N — 1 degrees of freedom. 
For instance, to construct a 95% confidence interval, we set a to 5% and hence we must look 
at the 97.5% percentile of the t-student. As the latter distribution is symmetric, the 97.5% 
percentile is equal to the absolute value of the 2.5% percentile, so that /i will belong to the 
interval X N + t N _i(a/2)a N /\^N, X N + tjv_i(l — a/2)aj^/VN with a probability of 95%. 
If normality does not hold, then we must in general employ the central limit theorem and 
the law of large numbers to justify the asymptotic approximation of the confidence interval 
based on the normal distribution. Let Xj ~ iid Fx{fi, <J 2 ) for % = 1, . . . , N. Standardizing 
the sample mean then yields 

^^^ = ^^^—-^(0,1) 
ctn cr ctjv 

given that the central limit theorem dictates that the first term of the right-hand side of the 
equality is asymptotically standard normal and consistency ensures that the second term 
converges in probability to one as N — > oo. Actually, the same result follows for any other 
consistent variance estimator. This means that yU G [X N + z a / 2 vn/VNi X N + zi_ a /2 a^/y/N] 
with probability 1 — a, where z a / 2 and ^i_ Q /2 are the a/2 and (1 — a/2) percentiles of the 
standard normal distribution. As before, due to the symmetry of the normal distribution, it 
turns out that z$ = —z±^s for any < 5 < 1. 
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Chapter 7 
Hypothesis testing 



We know from previous chapters that, whenever we try to infer a quantity, the resulting 
estimate varies with the sample. The random nature of any sample statistic is such that 
we should always think twice before interpreting a result. For instance, we might wonder 
whether a sample mean of 11 confirms or not a hypothesized value of 10 for the population 
mean. The answer of course lies on the typical variation of the data. If the standard deviation 
is very small, say 0.01, then a sample mean of 11 is pretty far away from 10. We probably 
would conclude differently if the standard deviation were large, say 5. Building confidence 
intervals is just a first step to answer this type of questions. Significance (or hypothesis) 
testing provides a much more general tool for this sort of task. In particular, it allows us to 
check on how strong the statistical evidence is in favor of or against a hypothesis about the 
data (e.g., whether the true mean is 10 given that the sample mean is 11). 

Significance testing starts with a partition of the probability space into two regions: the 
null hypothesis EI and the alternative hypothesis Hi. The former consists of all events for 
which the relation of interest holds, whereas the alternative hypothesis is simply the negation 
of the null. The idea is to observe the data seeking for evidence against the null hypothesis. 
Note that a statistical test can at best contradict the null hypothesis. However, failing to 
reject the null hypothesis does not necessarily mean that we should accept it, just that we do 
not have enough material to reject it. The testing strategy then is to develop a statistic that 
should reflect the relation of interest as hypothesized by the null. This means that we will 
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have to derive the sampling distribution of some test statistic conditioning on the fact that 
the null holds. To make this task easier, we normally define the null as the simplest case. For 
instance, it is much easier to compute the distribution for the sample mean by setting the 
population mean to 10 rather than considering any value different from 10. In general, the 
alternative hypothesis typically reflects a change/impact in the process/population, while 
the null hypothesis indicates the absence of a change or impact. 

Examples 

(1) Suppose we wish to test whether the true population mean \x exceeds 15. We can then 
define the null hypothesis as EI : \x > 15 and the alternative hypothesis as Hi : // < 15. 
This is a directional test given that we are attempting to evince deviations with respect to a 
particular direction. We could thus consider a statistical test that rejects the sample mean if 
it belongs to a given interval. Intuitively, to determine the latter, we should appreciate that 
it does not suffice to observe a sample mean below 15 to reject the null. The reason is simple. 
As a consistent estimator, the sample mean should provide us a value in the neighborhood 
of the true mean. If the latter is 15.1, for instance, then it is likely that we will observe a 
sample mean below 15 even though the null hypothesis hold. 

(2) Suppose now the interest lies on testing whether /i = 15. We then define the null 
hypothesis as H : fi = 15 given that it is easier to derive the distribution of the sample mean 
if we know the true value of the population mean. In contrast to the previous example, testing 
Ho involves no direction in that we should observe both positive and negative deviations 
with respect to 15 (instead of only negative) in our attempt to reject the null hypothesis. As 
before, because of sampling variation, we should consider a test in which we reject the null 
if the sample mean is either too large or too small relative to 15. Needless to say, we should 
define 'too large' and 'too small' according to the distribution of the sample mean just as we 
have done for confidence intervals in the previous chapter. 

To understand whether the value we observe for a sample statistic is plausible or not, 
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given the amount of randomness we specify in the null hypothesis, we must hinge our analysis 
on the distribution of that sample statistic under the null. For instance, most people would 
not reject the null hypotheses in the above example if the sample mean were 14.9999, though 
most would reject the null in the second example if observing a sample mean of 500. As 
we mention in the second example, to find the appropriate rejection region, we must follow 
a procedure very similar to that we have used in the previous chapter to obtain confidence 
intervals. Indeed, the first step in the derivation of a statistical testing procedure is to obtain 
a rejection region by fixing a significance level for the test. Just as a confidence interval of 
95% will on average miss the true value of the parameter 5% of the times, a test with 95% 
significance level will incorrectly reject the null hypothesis at most 5% of the times. 
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The main difference between confidence intervals and tests is that there is only one type 
of error in the former, i.e., missing the true value of the parameter. In contrast, there are 
two sources of errors within hypothesis testing. We can either commit an error of type I or 
an error of type II. The first arises if we reject the null hypothesis even though it is true, 
whereas the second refers to the event of failing to reject a false null. To sum up, if we denote 
by R the event of rejecting the null hypothesis Ho, then Pr(type I error) = Pr(f2 | Ho is true) 
and Pr(type II error) = Pr(.R | Ho is false) with R denoting the complement event of R and 
hence a non-rejection of the null. 

Note that we treat these errors in a very asymmetric fashion in that we fix only the 
maximum tolerable probability of a type I error, e.g., Pr(type I error) < a% if the signifi- 
cance level is of (1 — a) %. It turns out that it is impossible to control both errors at the 
same type and hence the best we can do is to find a statistical procedure that minimizes the 
chances of an error of type II given a fixed probability of an error of type I. Alternatively, we 
could think of doing the contrary, that is to say, fixing the probability of a type II error and 
then obtain the testing procedure that minimizes the likelihood of a type I error. Although 
there is no logical reason to do so (i.e., fixing the error of type I), there is a moral reason 
(who said moral does not play a role in science?) as advocated by two of the most eminent 
statisticians of all times, namely, Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980). 
The motivation for their idea of fixing the type I error rests on a jury trial. Most people 
would agree with Neyman and Pearson that it is preferable to free a guilty criminal than to 
put an innocent in jail. We thus fix the probability of committing a type I error for it is 
more damaging than the type II error. 

Example We may see a pregnancy test as a statistical procedure that decides whether 
there is enough evidence supporting pregnancy. As statistical procedure can at best reject 
the null hypothesis, this means that the null of a pregnancy test is actually the absence of 
pregnancy. Most people would agree that a false negative is potentially more damaging than 
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a false positive. This is in line with the Neyman-Pearson solution in that a false positive 
corresponds to a type I error (rejecting a true null), whereas a false negative refers to a type 
II error (failing to reject a false null). 

In what follows, we first show how to derive the rejection region for sample means. We 
then introduce the concepts of size, level and power of a test, which derive from the type 
I and type II errors. Next, we introduce the notion of p-value, which somehow gauges the 
strength of the evidence against the null hypothesis. Computing p-values is an alternative 
to setting ex-ante the significance level of the test and hence a rejection region. Finally, 
in the lasts section, we discuss hypothesis testing in a more general fashion by assuming a 
likelihood approach. 

7.1 Rejection region for sample means 

We motivate this section with a simple example. A pharmaceutical lab is running clinic 
trials to assess whether a new medicine to control the levels of cholesterol indeed works 
better than the current medicine in the market. The clinical trials consider two groups of 
100 patients. Group A takes the new medicine, whereas group B are subject to the standard 
treatment. To evaluate the relative performance of the new medicine, the lab measures 
the difference in the cholesterol decrease between groups A and B (in percentage points). 
The null hypothesis of interest is that, on average, there is no difference between the two 
treatments: H : \ia-b = 0. 

The most natural estimator for the mean difference between the results of groups A and B 
is the difference between their sample means (or, equivalently, the sample mean difference). 
We must now ask ourselves whether a sample difference of, say, 1.06 is plausible given a 
standard deviation of, say, 2 under the null of zero mean difference. The idea is very similar 
to what we do if we wish to construct a confidence interval. The central limit theorem 
ensures that we can approximate the sampling distribution of the sample mean difference in 
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large samples by a standard Gaussian distribution if we subtract its mean and divide by its 
standard deviation. The null hypothesis says that the mean difference is zero, while saying 
nothing about the standard deviation. The standard error of the sample mean difference is 
the standard deviation divided by the square root of the sample size, so that it amounts to 
1/5. Letting z q denote the gth percentile of a standard normal distribution then yields 



Pr 



V>A-B - 



1/5 



> Zi- a /2 



H f 



Pr \fJ,A-B\ > 



2 l-a/2 



This suggests rejecting at the 5% significance level if |/xa-b| > \ Zi-a/2 = -ip = 0.392. The 
sample mean difference of 1.06 is obviously superior to 0.392 and hence we conclude that 
the two treatments entail different performances. 




01 Ernst &Young 

Quality In Everything We Do 



Download free eBooks at bookboon.com 



132 



^ 



Click on the ad to read more 



Statistics for Business and Economics Hypothesis testing 

The above test is two-sided in that it looks whether both positive and negative deviations 
are large. The fact that the sample mean difference is positive also indicates that the new 
medicine performs relatively better. To confirm that, we must test a directional null hypothe- 
sis by means of one-sided tests. So, if we change the null to H : \ia-b > 0, only large negative 
deviations will contradict the null hypothesis. It then follow from Pr {JIa-b < if I Ho) — a 
that the rejection region is (—00, ^]. Note that the ath percentile of the standard normal 
distribution is negative (z a < 0) for any a < 1/2. In contrast, if we define the null hypoth- 
esis as Ho : Ha-b < 0, then we would have to worry only about large positive sample mean 
differences, yielding [^==^,00) as a rejection region given that Pr (JIa-b > '^rf- | Ho) — a - 
Figure 7.1 illustrates the main standard normal percentiles for testing purposes. 

Let us now consider the general two-sided case in which we wish to test the null that the 
population mean of a random sample X = (X 1; . . . , X^) is equal to /io- In what follows, we 
consider three different setups. The first and second settings respectively assume normality 
with known and unknown variances, whereas the third imposes no specific distribution for 
the data. Regardless of the data distribution, the sample mean is on average equal to /io 
under the null hypothesis, with variance a 2 /N. 

Normality, known variance: If Xj ~ iid A/"(/x, u 2 ) with a known variance a 2 , it follows 
that 77^ is standard normal and hence the rejection region at the a significance level is 
(—00, iA + -fe Za/2] U [/-to + -fe Zi- a /2, 00). Note that the above intervals are symmetric in 
that Zi_ a /2 = —z a / 2 for any < a < 1. 

Normality, unknown variance: If Xj ~ iid A/"(/i, u 2 ) with a unknown variance a 2 , 
it follows that JT/4t is t-student with N — 1 degrees of freedom. The rejection region at 
the a significance level is (— 00, /io + -fe%-i] U \Po + "/%%-i > °°)> w h ere t£ denote the 
qth percentile of the t-student with d degrees of freedom. Note that the above intervals 
are symmetric given that t d = — tj* for any < a < 1 regardless of the number of 

degrees of freedom. 
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-2.57 -1.96 -1.645 



Figure 7.1: The area in yellow corresponds to 1% of the probability mass of a standard 
normal distribution (i.e., 0.5% in the lower tail plus 0.5% in the upper tail), whereas the 
area in yellow and blue responds for 5% of the probability mass (i.e., 2.5% in the lower tail 
plus 2.5% in the upper tail). Finally, the area in pink entails another 2.5% of the probability 
mass in each of the tails, so that the area in yellow, blue and pink amounts to 10% of the 
probability mass of a standard normal distribution. 



Unknown distribution: If Xi ~ iid fx{n,& 2 ) with a unknown variance a 2 , it follows 
from the central limit theorem that J*~J^ converges in distribution to a standard normal 
as the sample size grows. The asymptotic rejection region at the a significance level then is 

(-OO, flQ + -J= Z a/2 ] U [fl + -^ Zl-a/2, 00). 

Adapting the above rejection intervals to one-sided tests is pretty straightforward. It 
suffices to take the interval of interest and replace a/2 by a in the percentile of the sampling 
distribution. For instance, under normality and unknown variance, the rejection region for 
the one-sided test for H : fj, > /i against Wi : fi < [i is [/i + r^%-i>°°)> whereas we 
would reject H : \x < //o if the test statistic falls within (— oo,//q + ~fe%-i]- The critical 
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value that defines the rejection region for the one-sided test depends on fiQ just as the two- 
sided test, despite the fact we define the null hypothesis through an inequality rather than 
equality. By conditioning the test statistic on /i , we are essentially taking the least favorable 
situation within the alternative hypothesis, that is to say, the value of [i e Hi that is closest 
to the null hypothesis. It turns out that considering the least favorable situation within the 
alternative hypothesis alleviates the probability of committing a type II error. 

Example: A barista prepares a sequence of 16 espressos, taking note of how much time it 
takes to pour the best possible expresso. As the sample mean time amounts to 26 seconds, 
the barista concludes that the expresso machine is too fast and decides to fine-tune it in order 
to increase the preparation time by 2 seconds. The quality-control manager disagrees with 
the barista for the following reasons. First, the sample size is too small. Second, the sample 
is not entirely random given that the barista prepares the espressos in a straight sequence. 
Third, the barista is not accounting for the randomness of the data. The sample standard 
deviation is indeed quite palpable, at 6 seconds. Fourth, it could be more interesting to 
fine-tune the expresso machine so as to reduce the variability of the preparation time rather 
than to increase the mean time. To substantiate her argument, the quality-control manager 
tests whether the mean time is equal to 28 within a random sample context. The difference 
between the hypothesized and sample means is equal to 2 seconds. The following large- 
sample approximation then holds 
X 16 — 28 



Pr V16 



6 



> Zi- a /2 



Ho : n = 28 J ^ 1 - $(zi_ a/2 ) + $(-zi_ a/3 ) 

= 1 - $0l_ a/2 ) + $(« a /j) 

= 1- (1 - a/2) + a/2 = a, 

where $(•) denotes the cumulative distribution function of a standard normal. Setting a to 
5% then yields a test that rejects the null H : /i = 28 if the sample mean does not belong 
to the interval [28 ± 3x ^' 96 ] . This is not the case here as the sample mean is 26 seconds and 

so there is no statistical reason to believe that the expresso machine requires adjustment. 
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7.2 Size, level, and power of a test 

In this section, we extend the discussion to a more general setting in which we are interested 
in a parameter 9 of the distribution (not necessarily the mean). As before, the derivation of 
a testing procedure involves two major steps. The first is to obtain a test statistic that is 
able to distinguish the null from the alternative hypothesis. For instance, if we are interested 
in the arrival rate of a Poisson distribution, it is then natural to focus either on the sample 
mean or on the sample variance. 1 The second is to derive the rejection region for the test 
statistic. The rejection region depends of course on the level of significance a, which denotes 
the upper limit for the probability of committing a type I error. A similar concept is given by 
the (exact/asymptotic) size of a test, which corresponds to the (exact/limiting) probability 
of observing a type I error. 

In general, we are only able to compute the size of a test if both null and alternative 
hypotheses are simple, that is to say, they involve only one value for the parameter vector 
Ho : 9 = 9 against Hi : 9 = 9\. Unfortunately, most situations refer to at least one 
composite hypothesis, e.g., H : 9 = 9 against Hi : 9 < 9 or H : 9 = 9 against Hi : 9 > 9 
or H : 9 = 9 against M 1 : 9 ^ 9 X or H : 9 > 9 against Hi : 9 < 9 or H : 9 < 9 against 
Hi : 9 > 9 . Note that it does not make much sense to think about a situation in which the 
null hypothesis is composite and the alternative is simple. It is always easier to derive the 
distribution of the test statistic for a given value of the parameter (rather than an interval), 
and so it would payoff to invert the hypotheses. 

Well, both level and size relate to the type I error. To make it fair, we will now define a 
concept that derives from the probability of committing a type II error. The power of a test 
is the probability of correctly rejecting the null hypothesis, namely, 

Pr(R | H is false) = 1 - Pr(i? | H is false) = 1 - Pr(type II error) 

So, we should attempt to obtain the most powerful test as possible if we wish to minimize 



1 Recall that if X is a Poisson with arrival rate A, then E(X) = var(X) = A. 
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the likelihood of having a type II error. In general, the power of a test is a function of the 
value of the parameter vector under the alternative. The power function degenerates to a 
constant only in the event of a simple alternative hypothesis, viz. Hi : 9 = d\. To work out 
the logic of the derivation of the power, let's revisit the barista example from the previous 
section. 

Example: Suppose that it actually takes on average 24 seconds for pouring a perfect 
expresso. In the previous section, we have computed a large-sample approximation under 
the null for the distribution of the sample mean. We now derive the asymptotic power of 
the means test at the a level of significance conditioning on \x = 24. 
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The probability of falling into the rejection region is 
X±e — 28 



Pr V16 



6 



> 1.96 



/i = 24 



1 - Pr 2 



Xi6 — 28 



< 1.96 



/i = 24 



1 - Pr (28 - 2.94 < X 16 < 28 + 2.94 | fi = 24) 

1 - Pr (25.06 < X 16 < 30.94 | fi = 24) 

/ r~ 25.06 - 24 r— X 16 - 24 /— 30.94 - 24 
1 - Pr y/W < VIQ -^ < VIQ 






, 6.94 \ ^ /1.06 



6 



3/2 ) " V3/2 
1 - 0.999998142 + 0.760113176 = 0.760115034. 



fi = 24 



Note that this power figure holds only asymptotically for we are taking the normal approxi- 
mation for the unknown distribution of the sample mean. 

In general, to compute the (asymptotic) power function of a two-sided means test, it 
suffices to appreciate that the probability of rejecting the null for fi = fix ^ fio is 

Xn — Mo 



Pr VN 



a N 



> Zl-a/2 



fi = fi! 



1 - Pr VN 



Xn — Mo 



(T N 



< Zl-a/2 



fJL = (JLi 



- Zl . a/2 < Vn *^° < Zl _ a/2 



a N 



cr N 



1 - Pr /i - Z x _ a / 2 ~j= <X N <H0 + Zx-aj 



fi = fi x 

a N 



N 



fi = fix 



aN S V S l ° N 

1 - Pr I no - ^ - zx- a /2 —p= < X N - fix < Mo - Mi + Zi- a /2 



N 



fi = fix 



Pr I VN Zx-a/2 < VN < VN h Zx- a /2 



Xn — Mi 



a N 



a N 



o~n 



H = fix 



^1-^(Vn^^ + zx^ /2 ) + ^(Vn^^-Zx^ 
\ ctn J \ cr N 

Note that the power function converges to one as the sample size increases provided that 

fi ^ fix because both cumulative distribution functions converge to the same value (namely, 

±1 depending on whether fi ^ fix). 
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It is straightforward to deal with one-sided tests as well. For instance, for a means 
test of Ho : /i = /io against Hi : ji > /j,q, the test statistic is ~ N 7n^ with an asymptotic 



0"7V 



/Vn 



critical value given by the (1 — a)th percentile of the standard normal distribution given 
that Pr ( - N ~' / % > Zi_ a ) — a under the null hypothesis. Letting Hi > Ho denote a mean 
value under the alternative yields a power of 



a N 



H = Hi) = Pr I X N < no + z x _ c 



N 
1 - Pr I X N - Hi < Ho - Hi + z x 



H = Hi 

a N 



X N — Hi Uq — \X\ 

1 - Pr — < ■= + z\. 

1 a N /^/N - a N /VN 



N 



/i = hi 



As before, power converges to one as the sample size increases. This property is known 
as consistency. We say a test is consistent if it has asymptotic unit power for any fixed 
alternative. 

In the previous chapter, we have seen that it is typically very difficult to obtain efficient 
estimators if we do not restrict attention to a specific class (e.g., class of unbiased estimators). 
The same problem arises if we wish to derive a uniformly most powerful test at a certain 
significance level. Unless we confine attention to simple null and alternative hypotheses, it 
is not possible to derive optimal tests without imposing further restrictions. To appreciate 
why, it suffices to imagine a situation in which we wish to test Ho : 9 = 6q against Hi : 9 ^ 9q. 
It is easy to see that the one-sided test for H : 9 = 9 against H : 9 > 9q is more powerful 
than the two-sided test if 9 — 9\ > 9o, just as the one-sided test for H : 9 = 9 against 
H : 9 < 9 is more powerful than the two-sided test if 9 = 9\ < 9 . 

Figure 7.2 illustrates this fact by plotting the power functions of one-sided tests for 

Ho : 9 = 9q against either Hi : 9 < 9q or Hi : 9 > 9q at the a and a/2 level of significance. 

The power of the one-sided tests are inferior to their levels of significance for values of 9 that 

strongly contradict the alternative hypothesis (e.g., large positive values for Hi : 9 < # )- 
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power 




Ii : 9 > 9 , level a 
h:6>6 , level a/2 



El : 9 < 0o, level a 
Hi : 9 < 9 , level a/2 

e 



Figure 7.2: Power functions of one-sided tests for Ho : 9 = 0q 

This is natural, though not acceptable for a test of H : 9 = 9 , because these tests are not 
designed to look at deviations from the null in both directions. That's exactly why we prefer 
to restrict attention to unbiased tests, that is to say, tests whose power are always above 
size. Applying such a criterion to the above situation clarifies why most people would prefer 
the two-sided test instead of one of the one-sided tests. To obtain the power function of a 
two-sided test of Ho : 9 = 9q, it suffices to sum up the power function of the one-sided tests 
at a/2 significance level against Hi : 9 > 9q and Hi : 9 < 9q. 
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7.3 Interpreting p-values 

The Neyman-Pearson paradigm leads to a dichotomy in the context of hypothesis testing 
in that we can either reject or not the null hypothesis given a certain significance level. 
We would expect however that there are rejections and rejections. How far a test statistic 
extends into the rejection region should intuitively convey some information about the weight 
of the sample evidence against the null hypothesis. To measure how much evidence we have 
against the null, we employ the concept of p-value, which refers to the probability under the 
null that the value of the test statistic is at least as extreme as the one we actually observe 
in the sample. Smaller p-values correspond to more conclusive sample evidence given that 
we impose the null. In other words, the p-value is the smallest significance level at which we 
would reject the null hypothesis given the observed value of the test statistic. 

Computing p-values is like taking the opposite route we take to derive a rejection region. 
To obtain the latter, we fix the level of significance a in the computation of the critical values. 
To find a p-value of an one-sided test, we compute the tail probability of the test statistic by 
evaluating the corresponding distribution at the sample statistic. As for two-sided tests, we 
must just multiply the one-sided p-value by two if the sampling distribution is symmetric. 
The main difference between the level of significance and the p-value is that the latter is a 
function of the sample, whereas we the former is a fixed probability that we choose ex-ante. 

For instance, the p-value of an asymptotic means test is 



Vn ^^ > v^v ^^ ) - - 1 - $ I Vn a 



a N 



a N 



a N 



if the alternative hypothesis is Hi : /i > fi , whereas it is 



Pr | y/N XN -^ < y/N ^'^° ) = $ ( V^V " 



a N 



a N 



a N 



for Mi : fi < fi . As for two-sided tests, the p-value reads 



Vn Xn ~ ^ > Vn 

a N 



x N - fX 



a N 



$ VN 



%n — fJ-o 



a N 
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for Hi : yU ^ fi . To better understand how we compute p-values, let's revisit the barista 
example one more time. 

Example: Under the null distribution that it takes on average 28 seconds for pouring 
a perfect expresso, the asymptotic normal approximation for the distribution of the sample 
mean implies the following p-value for a sample mean of 26 seconds: 



2Pr( v / l6 Xl6 ~ 28 >v / l6 
6 



xir-28 



6 



H : fj, = 28 



1-$ 4 



26-28 



6 
2[1 - $(4/3)] = 2(1 - 0.90878878) 

0.18242244. 



This means that we cannot reject the null hypothesis at the usual levels of significance (i.e., 
1%, 5% and 10%). We must be ready to consider a level of significance of about 18.25% if 
we really wish to reject the null. 

Before concluding this section, it is useful to talk about what p-value is not about. First, 
it is not about the probability that the null hypothesis is true. We could never produce 
such a probability. We compute the p-value under the null and hence it cannot say anything 
about how likely the null hypothesis is. In addition, it does not make any sense to compute 
the probability of a hypothesis given that the latter is not a random variable. Second, a large 
p-value does not necessarily imply that the null is true. It just means that we don't have 
enough evidence to reject it. Third, the p-value does not say anything about the magnitude 
of the deviation with respect to the null hypothesis. To sum up, the p-value entails the 
confidence that we may have in the null hypothesis to explain the result we actually observe 
in the sample. 

7.4 Likelihood-based tests 

The discussion in Section suggests that it is very often the case there is no uniformly most 

powerful test for a given set of null and alternative hypotheses. It turns nonetheless out that 
Download free eBooks at bookboon.com 

142 



Statistics for Business and Economics 



Hypothesis testing 



likelihood-based tests typically yield very powerful tests in a wide array of situations. In 
particular, if it exists, a uniformly most powerful (unbiased) test is very often equivalent to a 
likelihood-based test. This means that likelihood methods entail not only efficient estimators, 
but also a framework to build satisfactory tests. 

Let 6 G C M fc denote a fc-dimensional parameter vector of which the likelihood C(6; X) 
is a function. Consider the problem of testing the composite null hypothesis Ho : 6 e 0n 
against the composite alternative hypothesis Ho : 9 G — Oq- We now define the likelihood 



ratio as 



5(0) 



= max0ge o £(fl;X) = C(6 N ; X) 
' maxfee C{0- X) " c(0 N ; X) ' 



■~(0) - 

where N and 0^ are the restricted and unrestricted maximum likelihood estimators, re- 
spectively. The restricted optimization means that we search for the parameter vector that 
maximizes the log-likelihood function only within the null parameter space ©o, whereas the 
unrestricted optimization yields the usual ML estimator of 6. 
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The intuition for a likelihood-ratio test is very simple. In the event that the null hypoth- 
esis is true, the unrestricted optimization will (in the limit as N — ► oo) yield a value for 
the parameter vector within 0n and hence the log-likelihood ratio will take a unit value. If 
the null is false, then the unrestricted optimization will yield a value for G — 0o and 
hence the ratio will take a value below one. This suggests a rejection region of the form 
{X : X(X) < Co} for some constant < C a < 1 that depends on the significance level a. 



Example: Let X denote a random sample from a normal distribution with mean /i and 



variance a 2 . Suppose that the interest lies on testing the null hypothesis Ho : /a 



IM) 



against the alternative Mi : fi ^ fi by means of likelihood methods. As the (unre- 



stricted) likelihood function is (2-na 



2\-N/2 



exp 



^2 J2i=i(Xi — /i) 2 , the (unrestricted) max- 



imum likelihood estimators for /i and a 2 are the sample mean Xn and sample variance a 



N- 



In contrast, confining attention to the null hypothesis yields a restricted likelihood func- 



tion of (27T<7 



2\-N/2 



exp 



^2 Ei=i(^ — ^o) 2 with restricted ML estimators given by /io and 



a 2 N = jj J^i=iPQ — /^o) 2 - K then follows that the likelihood ratio is 



X(X) 



(2na 



~2 n) -N/2 



exp 



k5Xi (*<-/*) 



2a 



(2ira 2 N )- N / 2 



exp 



kXUx.-x 



vEi=l(^i "^o) 



-N/2 



exp 



Nj 






vEi=l(^i _ X 



N) 



-N/2 



exp 






kN 



Ei=i(^-^o) 



-N/2 



exp(iV/2) 



^N 



ZZ^-Xn) 2 



-N/2 



exp(iV/2) 



^JV 



Eti(^-JGv) 2 



-i -AT/2 



To compute the critical value k a of the rejection region, we must first derive the distribution 
of X(X) under the null distribution. This may look like a daunting task, but it is actually 
straightforward for we can write the numerator of the fraction as 

N N N 

J2(X t - /i ) 2 = ^(X, - X N + X N - /i ) 2 = ^(X, - X N f - N(X N - /i ) 2 , 



i=l 



i=l 



i=l 
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which implies that 

MX) ■ ■ "<*»-*.>' " m 



1 



Y,i=i( x i - x N ) 



Well, now it suffices to appreciate that the likelihood ratio is a monotone decreasing function 
of \/N(Xn — jjq)/sn given that the fraction within brackets is the square of the latter divided 
by N — 1. It then follows from yN{X^ — [Io)/sn ~ tN-i that a rejection region of the form 
{X : \/N(Xn — /j, )/sn > tjv-i(l — a/2)}, where tjv-i(l — a/2) is the (1 — a/2)th percentile 
of a t-student distribution with iV — 1 degrees of freedom, yields a test with a significance 
level of a. 

The above example shows that it is possible to compute the rejection rate of a likelihood 
ratio test by looking at whether it depends exclusively on a statistic with a known sampling 
distribution. In general, however, it is very difficult to derive the exact sampling distribution 
of the likelihood ratio and so we must employ asymptotic approximations. Assume, for 
instance, that X = {X\, . . . ,Xn) is a random sample from a distribution Fq and that we 
wish to test H : 9 = 9q against EI : 9 ^ 9q. The fact that the unrestricted ML estimator is 
consistent under both the null and alternative hypotheses ensures that ln£(#o; X) admits a 
Taylor expansion around 9n'- 



d , „,- „ w „ - x Id 



2 



,2 



\nC(9 ;X) = \nC(9 N ;X) + —In C(9 N ; X)(9 - 9 N ) + -— In C(9 N ; X)(9 - 9 N) 

1 <9 3 
+ -— \nC(9^X)(9 -9 N ) 3 , 

where 9* = X9q + (1 — A)6*at for < A < 1. The definition of the ML estimator is such 
that the first derivative of the log-likelihood function is zero, whereas the fact that 9^ is a 
yN— consistent estimator ensures that the last term of the expansion converges to zero at 
a very fast rate. It then follows that 

-21nA(X) = 2(ln£$v;X)-ln£(0 o ;X) = -^C(9 N ;X)(9 N - 9 ) 2 . (7.1) 

Now, we know that under the null yN{9^ — 9q) weakly converges to a normal distribution 
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with mean zero and variance given by the inverse of the information matrix 



1 d 2 - 
- hm — — — £(9 N ;X). 

n^oo N d6 2 



This means that LR = — 2 In X(X) is asymptotically chi-square with one degree of freedom 
for the right-hand side of (7.1) is the square of a standard normal variate. This suggests 
that a test that rejects the null hypothesis if the likelihood ratio LR > ^(1 — a), where 
the latter denotes the (1 — a)th percentile of the chi-square distribution with one degree of 
freedom, is asymptotically of level a. 



Example: Let X{ ~ iid Poisson(A) for % = 1, . . . , N and define the null and alternative 
hypotheses as H : A = An and Hi : A ^ A , respectively. The likelihood ratio then is 



Eti^i 



LR= -21nA(X) 



-2 In 



exp(-iVAo) A^ 
exp(-NX N )X^ Xl 



-2N 



(A - X N ) - X N ln(A /AAr) 



d 2 

— +Xi> 



where A 



N~ x 



1 sr^N 



Yli=i X{ is the ML estimator of the Poisson arrival rate. 
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We next extend this result to a more general setting as well as derive two additional 
likelihood-based tests that are asymptotically equivalent to the likelihood ratio test. We 
start by establishing some notation. Let 0n = {9 : R{9) = 0, 0G 0}, where R{9) = 
represents a system of r nonlinear equations concerning 9. For instance, we could think of 
testing whether 6\ + 9 2 = 1 and 9 3 = . . . = 9 k = 0, giving way to a system of r = k — 1 
restrictions of the form R(9) = (0\ + 62 — 1, #3, . . . , Ok)' = 0. Recall that the unrestricted 
maximum likelihood estimator On is such that \N(9n — 9) — ► N(0,X^~(0)) and that 
the score function is such that -7= ^ ln£(0; X) — ► J\f(0,X oo (9)), where 1^(9) is the 
information matrix. In contrast, the restricted maximum likelihood estimator 9^ maximizes 
the log-likelihood function subject to R(9) = (and so it does not equate the score function 
to zero for it has to account for the Lagrange multiplier term). 

Along the same lines as before, the likelihood ratio is 

LR= -21nA(X) = 2 (\n£(9 N ;X) - \nC(9 N ;X) 

d 



(9n — On) 



d9d9 



-jL(9isr] X) 



9n~9 n ) (7.2) 



given that, under the null, a Taylor expansion is admissible for both estimators are consistent 
and hence close to each other. Now, it is possible to show that, under the null, the asymptotic 
variance of VN(9 N — 9 N ) is 



lim 

N— >oo 



Nd9W C{6N;X] 



This implies that the right-hand side of (7.2) converges in distribution to a chi-square with 
r degrees of freedom. To appreciate why, it suffices to observe that 9n and 9^ respectively 
estimate k and k — r free parameters, so that their difference concerns only r elements. 

Figure 7.3 shows that the likelihood ratio test gauges the difference between the criterion 
function that we maximize either with or without constraints. It also illustrate two alter- 
native routes to assess whether the data is consistent with the constraints in the parameter 

space. The first is to measure the difference between the restricted and unrestricted ML 
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estimators or, equivalently, to evaluate whether the unrestricted ML estimator satisfies the 
restriction in the null hypothesis. This testing strategy gives way to what we call Wald tests. 
The second route is to evaluate whether the score function of the constrained ML estimator 
is close to zero. The motivation lies on the fact that, in the limit, it is completely costless to 
impose a true null. This translates into a Lagrange multiplier in the vicinity of zero, so that 
the first-order condition reduces to equating the score function to zero. Lagrange multiplier 
tests rely then on measuring how different from zero is the score function evaluated at the 
constrained ML estimator. 



In C{9,X) 

ln£(6 N ,X 

\n£(9 N ,X)- 




Figure 7.3: Likelihood-based tests based on unrestricted and restricted ML estimators (9n 
and 8n, respectively). The log-likelihood test measures the difference between the con- 
strained and unconstrained log-likelihood functions, whereas the Wald test gauges the dif- 
ference between the unrestricted and restricted ML estimators. The Lagrange multiplier 
test assesses the magnitude of the constrained score function by focusing on the slope of 
the green line. The zero slope of the red line reflects the fact that the unconstrained score 
function is equal to zero by definition. 
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We first show how to compute Wald tests and then discuss Lagrange multiplier tests. As 
usual, we will derive the necessary asymptotic theory by means of Taylor expansions. Wald 
tests are about whether the unconstrained ML estimator meets the restrictions in the null 
hypothesis and so we start with a Taylor expansion of R(0) around On, namely, 

R{0) = R(6 N ) + R e (0 - 6 N ) with R e = ^Lr{0). 

Oa 



It is now evident that v N[R(6n) — R(9)] will converge to a multivariate normal distri- 
bution with mean zero and covariance matrix given by Rgl^(6) R' . 2 Well, if the null is 
true, we expect that the (unrestricted) ML estimator will approximately satisfy the system 
of nonlinear restrictions in that R(8n) — 0. This suggests gauging whether the mag- 
nitude of R(0n) deviates from zero significantly as a way of testing Ho against Hi. In 
particular, we know that y/NR(6 N ) — > Af(0, R 1^(6) R' e ) under the null and hence 
it suffices to take a quadratic form of vNR(0n) normalized by its covariance matrix to 
end up with an asymptotically chi-square distribution with r degrees of freedom, namely, 
W = N R{6n)'[RoT^-{6) R^\~ 1 R{6n) — > Xr- Note that by taking a quadratic form we 
automatically avoid negative and positive deviations from zero to cancel out. The asymp- 
totic Wald test then rejects the null at the a significance level if W > x^(l — a), where 
the latter denotes the (1 — a)th percentile of the chi-square distribution with r degrees of 
freedom. 

Example: Let X{ ~ iid 13(1, p) for i — 1, . . . , N. Define the null and alternative hypotheses 
as H : p = po and Hi : p ^ po, respectively. The unconstrained maximum likelihood 
estimator of p is the sample mean p^ = Y2i=i -^i, whose variance is p(l — p)/N . Applying a 
central limit theorem then yields 

W = N (PN-P ) 2 _<*, 
p N (l ~p N ) 

suggesting us to reject the null at the a significance level if W > xi(l — oi). 



2 See Footnote 1 in Section 6.1.5 for a very brief discussion about the multivariate normal distribution. 
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We now turn our attention to the Lagrange multiplier test. The score function -^ In C(6; X) 
is on average zero for any e and hence it is zero also for any G ©o- In addition, the 
variance of the score function is under the null equal to 



d 

var ' W lnC(y6]X ' 



6 e O = -E 



7 \nC(6;X] 



dOdO 



0e6 r 



1n{0), 



which in the limit coincides with the information matrix X oo (0). It thus follows that 
LM = -g, In C{e N ;X)'I^{e N )^ In C(0 N ;X)-^ X r 

and hence we must reject the null hypothesis if LM > XrO- ~ a ) t° obtain an asymptotic test 
of level a. Note that the chi-square distribution has r degrees of freedom even though 6^ 
has k — r free parameters. This is because the score of the k — r free parameters must equate 
to zero, remaining only r dimensions for the score function to vary (i.e., those affected by 
the restrictions). 

Example: Let's revisit the previous example in which X = (Xi, . . . ,X^) with Xi ~ 
iid 13(1, p) for i = 1, . . . , N. The LM test statistic for M : p = p against Hi : p ^ p Q then is 

Po(l -Po) 
given that the score function evaluated at po is (pn~ Po)/[po(1 — Po)/N] and the corresponding 
information matrix is N/[po/(l — po)]. We would thus reject the null if LM > x?(l — a ) to 
obtain an asymptotic test at the a level of significance. 

In the above example, it is evident that the Wald and LM tests are asymptotically 
equivalent for the difference between their denominators shrink to zero under the null as the 
sample mean p^ converges almost surely to p$. This asymptotic equivalence actually holds 
in general, linking not only Wald and LM tests but also likelihood ratio tests. This should 
come with no surprise given that the three statistics intuitively carry the same information 
as it is easily seen in Figure 7.3. 
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